Node Reference

Complete reference for all node types in soak pipelines.

Node Types

Split

Divide documents or text into smaller chunks.

Type: Split

Parameters:

Parameter Type Default Description
name str "chunks" Node name
inputs List[str] ["documents"] Input nodes (max 1)
chunk_size int 20000 Target chunk size
min_split int 500 Minimum chunk size
overlap int 0 Overlap between chunks (in units)
split_unit str "tokens" Unit: "chars", "tokens", "words", "sentences", "paragraphs"
encoding_name str "cl100k_base" Tokenizer for split_unit="tokens"

Input: List of documents or TrackedItems Output: List of text chunks (as TrackedItems with provenance)

Example:

- name: chunks
  type: Split
  chunk_size: 30000
  overlap: 500
  split_unit: tokens

Export:

01_Split_chunks/
├── inputs/
│   ├── 0000_doc_name.txt
│   └── 0000_doc_name_metadata.json
├── outputs/
│   ├── 0000_doc_name__chunks__0.txt
│   ├── 0000_doc_name__chunks__0_metadata.json
│   └── ...
├── split_summary.txt
└── meta.txt

Provenance:

Source IDs include node name:

  • Input: doc_A
  • Output: doc_A__chunks__0, doc_A__chunks__1, …

Map

Apply an LLM prompt to each item independently in parallel.

Type: Map

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes
template str Required Jinja2 + struckdown template
model_name str From config LLM model
max_tokens int 4096 Max response tokens
temperature float 0.7 LLM temperature

Input: List of items Output: List of ChatterResult objects (one per input item)

Template Access:

  • `` - Current item content
  • `` - Source tracking ID
  • `` - Item metadata
  • Any context variables from pipeline

Example:

- name: summaries
  type: Map
  max_tokens: 8000
  inputs:
    - chunks

---#summaries
Summarize this text in 2-3 sentences:



[[summary]]

Export:

02_Map_summaries/
├── inputs/
│   ├── 0000_doc__chunks__0.txt
│   └── ...
├── 0000_doc__chunks__0_prompt.md
├── 0000_doc__chunks__0_response.json
└── ...

Classifier

Extract structured data from each item using multiple choice and typed fields.

Type: Classifier

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes
template str Required Template with structured outputs
model_name str From config Single model name
model_names List[str] None Multiple models for agreement analysis
agreement_fields List[str] None Fields to calculate agreement on
max_tokens int 4096 Max response tokens

Input: List of items Output: List of dictionaries with extracted fields

Template Syntax:

  • [[pick:field|opt1,opt2,opt3]] - Single choice
  • [[pick*:field|opt1,opt2]] - Multiple choice
  • [[int:field]] - Integer
  • [[boolean:field]] - True/False
  • [[text:field]] - Free text
  • ¡OBLIVIATE - Clear context between questions

Example:

- name: classify
  type: Classifier
  model_names:
    - gpt-4o-mini
    - gpt-4.1-mini
  agreement_fields:
    - topic
    - sentiment
  inputs:
    - chunks

---#classify
Classify this text:



What is the topic?
[[pick:topic|health,tech,education,other]]

¡OBLIVIATE

What is the sentiment?
[[pick:sentiment|positive,negative,neutral]]

Multi-model Agreement:

When model_names has 2+ models:

  • Each model classifies independently
  • Agreement statistics calculated (Gwet’s AC1, Krippendorff’s Alpha, % agreement)
  • Results include per-model classifications and statistics

Export:

03_Classifier_classify/
├── inputs/
├── classifications.csv          # Main output with source tracking
├── classifications.json
├── summary.txt                  # Field distributions
├── prompt_template.sd.md
├── agreement_stats.json         # If multi-model
├── human_rating_template.txt    # Template for human raters
└── 0000_*_response.json         # Per-item responses

CSV Format:

index,source_id,doc_index,original_file,topic,sentiment
0,doc__chunks__0,0,data/doc.txt,health,positive
1,doc__chunks__1,0,data/doc.txt,tech,neutral

Reduce

Concatenate multiple items into single text.

Type: Reduce

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes (max 1)
template str "\n" Template for each item

Input: List of items Output: Single concatenated string

Example:

- name: all_codes
  type: Reduce
  inputs:
    - chunk_codes

---#all_codes

Extracts .codes field from each ChatterResult and concatenates.

Export:

04_Reduce_all_codes/
├── inputs/
├── result.txt
└── meta.txt

Transform

Apply LLM prompt to single input item (often the output of Reduce).

Type: Transform

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes
template str Required Jinja2 + struckdown template
model_name str From config LLM model
max_tokens int 4096 Max response tokens
temperature float 0.7 LLM temperature

Input: Single item (asserts exactly one input) Output: ChatterResult

Example:

- name: codes
  type: Transform
  max_tokens: 32000
  inputs:
    - all_codes
    - all_themes

---#codes
Consolidate these preliminary codes:



And these themes:



[[codenotes]]

[[code*:codes]]

Multiple Inputs:

When multiple inputs, all are available in template context:

inputs:
  - all_codes
  - all_themes

# Template can access:


Export:

05_Transform_codes/
├── inputs/
├── prompt.md
├── response.json
├── result.json
└── meta.txt

VerifyQuotes

Unified quote verification node that can:

  • Extract quotes from Codes OR Themes
  • Search in documents OR any custom node output
  • Verify quote existence (BM25 + embeddings + LLM-as-judge)
  • Optionally verify fairness of quote usage (for themes)

Type: VerifyQuotes

Parameters:

Parameter Type Default Description
name str "checkquotes" Node name
quotes_from str required* Node containing quotes (Codes or Themes)
search_in str null Node to search in (null = documents)
check_fairness bool false Enable fairness verification (themes only)
context_window_size int 1000 Context window for fairness check
window_size int 300 Window size for BM25 search
overlap int null Window overlap (auto: 30% of window_size)
bm25_k1 float 1.5 BM25 term frequency saturation
bm25_b float 0.4 BM25 length normalization
ellipsis_max_gap int 3 Max windows between ellipsis head/tail
trim_spans bool true Enable span refinement
trim_method str "fuzzy" Trimming: "fuzzy", "sliding_bm25", "hybrid"
min_fuzzy_ratio float 0.6 Minimum fuzzy match quality threshold
expand_window_neighbors int 1 Expand search to ±N windows if truncated
template str null Custom LLM existence verification template
fairness_template str null Custom fairness verification template

* Note: For backward compatibility, inputs[0] is used if quotes_from is not specified.

Input: Codes OR Themes (containing quotes) Output: Verification results with existence metrics and optional fairness verification

How it Works:

Stage 1: Existence Verification (BM25 + Embeddings)

  1. Extracts quotes from Codes or Themes
  2. Creates overlapping windows from search corpus
  3. Builds BM25 index over windows
  4. For each quote:
    • Finds best BM25 window (with ellipsis support)
    • Trims span to align with quote boundaries
    • Expands to neighbor windows if truncated
  5. Computes embedding similarity
  6. Tracks source document and positions

Stage 1.5: LLM Existence Check (for poor matches)

  • Runs LLM-as-judge on quotes with low BM25/cosine scores
  • Asks: “Is this quote contained in the source text?”
  • Returns explanation + boolean verification

Stage 2: Fairness Verification (optional, themes only)

  • If check_fairness=True and input is Themes:
    • Extracts context window around each quote
    • Presents LLM with: Theme + Code + Quote + Context
    • Asks: “Is this quote used fairly to support this theme?”
    • Returns explanation + boolean fairness judgment

Examples:

Verify Code quotes (backward compatible):

- name: checkquotes
  type: VerifyQuotes
  quotes_from: codes  # or use old 'inputs: [codes]'
  window_size: 450

Verify Theme quotes with fairness checking:

- name: verify_themes
  type: VerifyQuotes
  quotes_from: themes
  check_fairness: true
  context_window_size: 1000

Search in custom corpus:

- name: verify_in_summaries
  type: VerifyQuotes
  quotes_from: codes
  search_in: summaries  # Search in 'summaries' node output instead of documents

Export:

07_VerifyQuotes_checkquotes/
├── quote_verification.xlsx           # Formatted Excel (sorted by fairness/confidence)
├── stats.csv                          # Aggregate statistics
├── info.txt                           # Algorithm description
├── meta.txt
├── llm_existence_checks/              # LLM prompts/responses for poor matches
│   ├── 0000_{hash}_prompt.md
│   ├── 0000_{hash}_response.txt
│   └── 0000_{hash}_response.json
└── llm_fairness_checks/               # LLM prompts/responses (themes only)
    ├── 0000_{hash}_prompt.md
    ├── 0000_{hash}_response.txt
    └── 0000_{hash}_response.json

Output Metrics:

For all quotes:

  • bm25_score: Lexical relevance score
  • bm25_ratio: Match uniqueness (top1/top2)
  • cosine_similarity: Embedding similarity (0-1)
  • match_ratio: Fuzzy alignment quality (if trimming enabled)
  • source_doc: Source document name
  • global_start, global_end: Character positions
  • span_text: Matched text from source
  • llm_explanation: LLM explanation (poor matches only)
  • llm_is_contained: Boolean existence verification (poor matches only)

Additionally for themes (if check_fairness=True):

  • theme: Theme name
  • theme_description: Theme description
  • code_name: Code name
  • code_description: Code description
  • llm_fairness_explanation: LLM explanation for fairness
  • llm_is_fair: Boolean fairness judgment

Interpreting Results:

Codes: | BM25 Score | BM25 Ratio | Cosine Sim | LLM Contained | Interpretation | |———–|———–|———–|—————|—————-| | High | High | ~1.0 | N/A | ✓ Perfect verbatim match | | High | High | >0.9 | N/A | ✓ Near-exact (minor edits) | | Low | Low | >0.85 | True | ⚠ Poor match but LLM confirms | | Low | Low | <0.7 | False | ✗ Likely hallucination |

Themes: | Existence | Fairness | Interpretation | |———–|———-|—————-| | High BM25/Cosine | True | ✓ Quote exists and supports theme | | High BM25/Cosine | False | ⚠ Quote exists but taken out of context | | Low BM25/Cosine | True | ⚠ Poor match but fair usage | | Low BM25/Cosine | False | ✗ Hallucinated or misused |

See Also:

Cluster

Group items by semantic similarity using density-based clustering.

Type: Cluster

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes
items_field str "codes" Field to extract items from (null for TrackedItems)
text_field str "content" How to extract text for embedding
method ClusterMethod HDBSCAN defaults Clustering method configuration
skip_below int 20 Skip clustering if input count is below this threshold
if_skipped_bypass_to str null When skip_below triggers, bypass intermediate nodes to this target

HDBSCAN Method Parameters:

Parameter Type Default Description
name str "hdbscan" Method identifier
min_cluster_size_proportion float null Target cluster size as proportion (e.g., 0.25 = ~4 clusters)
min_cluster_size int 2 Hard floor – clusters never smaller than this
max_cluster_size int 100 Hard ceiling – clusters split if larger (null = no limit)
min_samples int 1 HDBSCAN min_samples parameter

Size Control Logic:

  • min_cluster_size_proportion is a goal – suggests cluster size based on total items
  • min_cluster_size is a hard floor – overrides proportion if higher
  • max_cluster_size is a hard ceiling – clusters exceeding this are split

Example with 100 items, proportion=0.25, min=10, max=50:

  • Proportion suggests min_cluster_size=25 (100 × 0.25)
  • Floor of 10 doesn’t apply (25 > 10)
  • Effective min_cluster_size = 25
  • Clusters larger than 50 get recursively split

Input: List of items (Codes, TrackedItems, or any objects) Output: List of TrackedItems, each representing a cluster

How It Works:

  1. Extract items using items_field (e.g., extract Code objects from CodeList)
  2. Extract text using text_field (content, metadata field, or Jinja2 template)
  3. Compute embeddings for all unique texts
  4. Run HDBSCAN with calculated effective min_cluster_size
  5. Handle oversized clusters by recursive splitting (respects max_cluster_size)
  6. Group singletons (noise points) into batches (see below)
  7. Return clusters as TrackedItems with original items in metadata

Singleton Handling:

HDBSCAN assigns items that don’t fit well into any cluster as “noise points” (label -1). These singletons are not discarded or placed into an “others” cluster with special naming – instead they are batched together into regular clusters:

  • Singletons are grouped into batches of up to max_cluster_size
  • If max_cluster_size is null, all singletons go into a single batch
  • These batches appear as normal clusters (e.g., cluster_12, cluster_13) in the output
  • There is no metadata flag distinguishing singleton-derived clusters from coherent clusters

The only place singleton counts are visible is in the export summary (cluster_summary.txt):

Processing stats:
  Singletons (noise points): 42

This means if you have 150 singletons and max_cluster_size=100, you get two additional clusters containing 100 and 50 items respectively, indistinguishable from semantically coherent clusters in downstream processing.

text_field Options:

  • "content" (default): Uses TrackedItem.content or str(item)
  • "metadata.field_name": Extracts from item.metadata[“field_name”]
  • ": ": Jinja2 template for custom text

Examples:

Cluster codes by name and description:

- name: grouped_codes
  type: Cluster
  inputs: [coded_chunks]
  items_field: codes
  text_field: ": "
  method:
    name: hdbscan
    min_cluster_size_proportion: 0.25
    min_cluster_size: 5
    max_cluster_size: 30

Cluster text chunks directly:

- name: grouped_chunks
  type: Cluster
  inputs: [chunks]
  items_field: null  # items are TrackedItems, not containers
  text_field: content

Large clusters for broad themes:

- name: broad_groups
  type: Cluster
  inputs: [all_codes]
  items_field: codes
  method:
    name: hdbscan
    min_cluster_size_proportion: 0.25  # ~4 groups
    max_cluster_size: null             # no upper limit

Accessing Cluster Results:

Each output cluster is a TrackedItem with:

  • content: Stringified cluster items (joined by ---)
  • metadata.items: Original items (Code objects, etc.)
  • metadata.cluster_id: e.g., “cluster_0”
  • metadata.cluster_size: Number of items

In downstream templates:


Export:

03_Cluster_grouped_codes/
├── cluster_summary.txt           # Statistics and per-cluster sizes
├── cluster_0_content.txt         # Stringified cluster content
├── cluster_1_content.txt
├── ...
├── outputs/
│   ├── cluster_0/
│   │   ├── 0000_code-slug.txt    # Individual items
│   │   └── 0001_code-slug.txt
│   ├── cluster_1/
│   │   └── ...
└── meta.txt

cluster_summary.txt contents:

Cluster Summary
===============
Total items: 345
Number of clusters: 18
Cluster size min: 5
Cluster size max: 50
Cluster size mean: 19.2

Method: hdbscan
max_cluster_size: 50
min_cluster_size: 10
min_cluster_size_proportion: 0.25
min_samples: 1
effective_min_cluster_size: 86

Processing stats:
  Singletons (noise points): 12
  Oversized clusters split: 2

Per-cluster sizes:
  cluster_47: 50
  cluster_12: 48
  cluster_3: 32
  ...

Common Patterns:

Code → Cluster → Consolidate:

nodes:
  - name: chunk_codes
    type: Map
    inputs: [chunks]

  - name: grouped_codes
    type: Cluster
    inputs: [chunk_codes]
    items_field: codes

  - name: themes
    type: Map
    inputs: [grouped_codes]
    # Each cluster becomes input for theme generation

Hierarchical clustering:

nodes:
  # First pass: many small clusters
  - name: fine_clusters
    type: Cluster
    method:
      name: hdbscan
      min_cluster_size: 3
      max_cluster_size: 10

  # Consolidate each cluster
  - name: consolidated
    type: Map
    inputs: [fine_clusters]

  # Second pass: fewer large clusters
  - name: broad_clusters
    type: Cluster
    inputs: [consolidated]
    method:
      name: hdbscan
      min_cluster_size_proportion: 0.2
      max_cluster_size: null

Conditional Bypass for Small Datasets:

When processing small datasets, clustering and consolidation may be unnecessary overhead. Use if_skipped_bypass_to to bypass intermediate nodes when input is below the skip_below threshold:

nodes:
  - name: codes_from_chunks
    type: Map
    inputs: [chunks]

  - name: grouped_codes
    type: Cluster
    inputs: [codes_from_chunks]
    items_field: codes
    skip_below: 20
    if_skipped_bypass_to: final_codes  # bypass consolidation

  - name: consolidated_codes
    type: Map
    inputs: [grouped_codes]

  - name: final_codes
    type: Reduce
    inputs: [consolidated_codes]
    items_field: codes

Behaviour:

When input has fewer than 20 codes:

  • grouped_codes passes its input through unchanged (no clustering)
  • consolidated_codes is automatically skipped
  • codes_from_chunks output flows directly to final_codes
  • No LLM calls for clustering or consolidation

When input has 20 or more codes:

  • Normal clustering and consolidation flow executes
  • All intermediate nodes run as configured

This is useful for pipelines that handle both small pilot studies and large datasets with the same configuration.

Batch

Group items into batches for processing.

Type: Batch

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes
batch_size int 10 Items per batch

Input: List of items Output: BatchList (list of lists)

Example:

- name: batched_chunks
  type: Batch
  batch_size: 5
  inputs:
    - chunks

Used with Reduce to process batches:

- name: batch_summaries
  type: Reduce
  inputs:
    - batched_chunks

---#batch_summaries
Summarize these  chunks together:

GroupBy

Group items by one or more field values, creating nested batch structures.

Type: GroupBy

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes
group_by List[str] Required Field names to group by

Input: List of items (TrackedItem or dict with metadata/outputs) Output: BatchList (nested if multiple group_by fields)

How It Works:

  • Groups items by values in specified fields
  • Fields can be from metadata, outputs, or ChatterResult attributes
  • Multiple fields create nested BatchLists
  • Each batch contains items sharing same field values

Single Field Example:

- name: by_category
  type: GroupBy
  group_by:
    - category
  inputs:
    - classified_items

If items have categories [“health”, “tech”, “health”], creates 2 batches:

  • Batch 1: Items with category=”health”
  • Batch 2: Items with category=”tech”

Multi-Field Example:

- name: by_category_and_sentiment
  type: GroupBy
  group_by:
    - category
    - sentiment
  inputs:
    - classified_items

Creates nested structure:

  • health → positive → [items]
  • health → negative → [items]
  • tech → positive → [items]
  • tech → neutral → [items]

Use Cases:

  • Process items differently based on classification
  • Analyze patterns within categories
  • Create hierarchical groupings
  • Prepare for category-specific transformations

Ungroup

Flatten all BatchList nesting levels, returning a flat list.

Type: Ungroup

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes (must be BatchList)

Input: BatchList (any nesting level) Output: Flat list of items

How It Works:

  • Recursively flattens all batch nesting
  • Preserves original item order
  • Removes all grouping structure

Example:

- name: flattened
  type: Ungroup
  inputs:
    - grouped_items

Converts nested structure:

[
  [item1, item2],      # Batch 1
  [item3],             # Batch 2
  [[item4], [item5]]   # Nested batches
]

To flat list:

[item1, item2, item3, item4, item5]

Use Cases:

  • Remove grouping after category-specific processing
  • Prepare batched results for non-batch-aware nodes
  • Flatten before final output
  • Combine results from multiple batch levels

Filter

Filter items based on a boolean expression (no LLM call).

Type: Filter

Parameters:

Parameter Type Default Description
name str Required Node name
inputs List[str] Required Input nodes (must produce list)
expression str Required Python expression using simpleeval

Input: List of items (TrackedItem or ChatterResult) Output: Filtered list (items where expression is truthy)

Filter Modes:

  1. LLM Mode - Run template through LLM, filter on extracted fields
  2. Simple Mode - Filter directly on item data

Mode auto-detected: if template provided, uses LLM mode.

LLM Mode Example:

- name: filtered
  type: Filter
  template: "Is this relevant? [[bool:is_relevant]]"
  expression: "is_relevant == True"
  inputs: [chunks]

Simple Mode Example:

- name: long_chunks
  type: Filter
  expression: "len(input) > 100"
  inputs: [chunks]

Common Expressions:

# Boolean response (ChatterResult)
"item['decision_node'].response is True"

# Numeric threshold from outputs
"item['score_node'].outputs['score'] > 0.5"

# Multiple conditions
"item['category_node'].outputs['category'] == 'relevant' and item['score_node'].outputs['score'] > 0.3"

Use Case:

Filtering items based on LLM decisions or computed scores without additional LLM calls.

Common Patterns

Parallel Processing

nodes:
  - name: chunks
    type: Split

  - name: process_chunks
    type: Map          # Processes all chunks in parallel
    inputs: [chunks]

Collect and Consolidate

nodes:
  - name: chunk_codes
    type: Map

  - name: all_codes
    type: Reduce       # Concatenate all outputs
    inputs: [chunk_codes]

  - name: final_codes
    type: Transform    # Consolidate into final result
    inputs: [all_codes]

Multi-input Transform

nodes:
  - name: codes
    type: Transform

  - name: themes
    type: Transform
    inputs: [codes]    # Uses codes output

  - name: narrative
    type: Transform
    inputs:
      - codes          # Access both in template
      - themes

Classification Pipeline

nodes:
  - name: chunks
    type: Split

  - name: classify
    type: Classifier
    inputs: [chunks]

Nested Splits

nodes:
  - name: chapters
    type: Split
    chunk_size: 50000

  - name: paragraphs
    type: Split
    chunk_size: 5000
    inputs: [chapters]    # Split the splits

Provenance: book__chapters__0__paragraphs__2

Template Reference

Available Variables

In all nodes:

  • Pipeline default_context variables
  • All previous node results (by node name)

In ItemsNode (Map, Classifier, Transform):

  • `` - Current item content
  • `` - Provenance ID
  • `` - Item metadata dict
  • `` - Full TrackedItem object

In Reduce:

  • `` - Each item being reduced
  • Named node variables (e.g., ``)

Jinja2 Features

Conditionals:


Loops:


Filters:




Struckdown Syntax

Return types for thematic analysis:

[[code*:codes]]          # List[Code]
[[theme*:themes]]        # List[Theme]
[[extract:text]]         # Free text
[[report]]               # Free text (narrative)

Return types for classification:

[[pick:field|a,b,c]]     # Single choice
[[pick*:field|a,b]]      # Multiple choice (list)
[[int:field]]            # Integer
[[boolean:field]]        # True/False
[[text:field]]           # Free text string

Context control:

¡BEGIN                   # Start new context
¡OBLIVIATE               # Clear context between questions

Node Configuration

Global Config

Set in pipeline front matter:

config:
  model_name: openai/gpt-4.1-mini
  llm_credentials:
    api_key: ${LLM_API_KEY}
    base_url: ${LLM_API_BASE}

Per-node Overrides

- name: detailed_analysis
  type: Map
  model_name: openai/gpt-4o     # Override for this node
  max_tokens: 16000
  temperature: 0.3
  inputs: [chunks]

Next Steps


This site uses Just the Docs, a documentation theme for Jekyll.