Node Reference
Complete reference for all node types in soak pipelines.
Node Types
Split
Divide documents or text into smaller chunks.
Type: Split
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | "chunks" | Node name |
inputs | List[str] | ["documents"] | Input nodes (max 1) |
chunk_size | int | 20000 | Target chunk size |
min_split | int | 500 | Minimum chunk size |
overlap | int | 0 | Overlap between chunks (in units) |
split_unit | str | "tokens" | Unit: "chars", "tokens", "words", "sentences", "paragraphs" |
encoding_name | str | "cl100k_base" | Tokenizer for split_unit="tokens" |
Input: List of documents or TrackedItems Output: List of text chunks (as TrackedItems with provenance)
Example:
- name: chunks
type: Split
chunk_size: 30000
overlap: 500
split_unit: tokens
Export:
01_Split_chunks/
├── inputs/
│ ├── 0000_doc_name.txt
│ └── 0000_doc_name_metadata.json
├── outputs/
│ ├── 0000_doc_name__chunks__0.txt
│ ├── 0000_doc_name__chunks__0_metadata.json
│ └── ...
├── split_summary.txt
└── meta.txt
Provenance:
Source IDs include node name:
- Input:
doc_A - Output:
doc_A__chunks__0,doc_A__chunks__1, …
Map
Apply an LLM prompt to each item independently in parallel.
Type: Map
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes |
template | str | Required | Jinja2 + struckdown template |
model_name | str | From config | LLM model |
max_tokens | int | 4096 | Max response tokens |
temperature | float | 0.7 | LLM temperature |
Input: List of items Output: List of ChatterResult objects (one per input item)
Template Access:
- `` - Current item content
- `` - Source tracking ID
- `` - Item metadata
- Any context variables from pipeline
Example:
- name: summaries
type: Map
max_tokens: 8000
inputs:
- chunks
---#summaries
Summarize this text in 2-3 sentences:
[[summary]]
Export:
02_Map_summaries/
├── inputs/
│ ├── 0000_doc__chunks__0.txt
│ └── ...
├── 0000_doc__chunks__0_prompt.md
├── 0000_doc__chunks__0_response.json
└── ...
Classifier
Extract structured data from each item using multiple choice and typed fields.
Type: Classifier
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes |
template | str | Required | Template with structured outputs |
model_name | str | From config | Single model name |
model_names | List[str] | None | Multiple models for agreement analysis |
agreement_fields | List[str] | None | Fields to calculate agreement on |
max_tokens | int | 4096 | Max response tokens |
Input: List of items Output: List of dictionaries with extracted fields
Template Syntax:
[[pick:field|opt1,opt2,opt3]]- Single choice[[pick*:field|opt1,opt2]]- Multiple choice[[int:field]]- Integer[[boolean:field]]- True/False[[text:field]]- Free text¡OBLIVIATE- Clear context between questions
Example:
- name: classify
type: Classifier
model_names:
- gpt-4o-mini
- gpt-4.1-mini
agreement_fields:
- topic
- sentiment
inputs:
- chunks
---#classify
Classify this text:
What is the topic?
[[pick:topic|health,tech,education,other]]
¡OBLIVIATE
What is the sentiment?
[[pick:sentiment|positive,negative,neutral]]
Multi-model Agreement:
When model_names has 2+ models:
- Each model classifies independently
- Agreement statistics calculated (Gwet’s AC1, Krippendorff’s Alpha, % agreement)
- Results include per-model classifications and statistics
Export:
03_Classifier_classify/
├── inputs/
├── classifications.csv # Main output with source tracking
├── classifications.json
├── summary.txt # Field distributions
├── prompt_template.sd.md
├── agreement_stats.json # If multi-model
├── human_rating_template.txt # Template for human raters
└── 0000_*_response.json # Per-item responses
CSV Format:
index,source_id,doc_index,original_file,topic,sentiment
0,doc__chunks__0,0,data/doc.txt,health,positive
1,doc__chunks__1,0,data/doc.txt,tech,neutral
Reduce
Concatenate multiple items into single text.
Type: Reduce
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes (max 1) |
template | str | "\n" | Template for each item |
Input: List of items Output: Single concatenated string
Example:
- name: all_codes
type: Reduce
inputs:
- chunk_codes
---#all_codes
Extracts .codes field from each ChatterResult and concatenates.
Export:
04_Reduce_all_codes/
├── inputs/
├── result.txt
└── meta.txt
Transform
Apply LLM prompt to single input item (often the output of Reduce).
Type: Transform
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes |
template | str | Required | Jinja2 + struckdown template |
model_name | str | From config | LLM model |
max_tokens | int | 4096 | Max response tokens |
temperature | float | 0.7 | LLM temperature |
Input: Single item (asserts exactly one input) Output: ChatterResult
Example:
- name: codes
type: Transform
max_tokens: 32000
inputs:
- all_codes
- all_themes
---#codes
Consolidate these preliminary codes:
And these themes:
[[codenotes]]
[[code*:codes]]
Multiple Inputs:
When multiple inputs, all are available in template context:
inputs:
- all_codes
- all_themes
# Template can access:
Export:
05_Transform_codes/
├── inputs/
├── prompt.md
├── response.json
├── result.json
└── meta.txt
VerifyQuotes
Unified quote verification node that can:
- Extract quotes from Codes OR Themes
- Search in documents OR any custom node output
- Verify quote existence (BM25 + embeddings + LLM-as-judge)
- Optionally verify fairness of quote usage (for themes)
Type: VerifyQuotes
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | "checkquotes" | Node name |
quotes_from | str | required* | Node containing quotes (Codes or Themes) |
search_in | str | null | Node to search in (null = documents) |
check_fairness | bool | false | Enable fairness verification (themes only) |
context_window_size | int | 1000 | Context window for fairness check |
window_size | int | 300 | Window size for BM25 search |
overlap | int | null | Window overlap (auto: 30% of window_size) |
bm25_k1 | float | 1.5 | BM25 term frequency saturation |
bm25_b | float | 0.4 | BM25 length normalization |
ellipsis_max_gap | int | 3 | Max windows between ellipsis head/tail |
trim_spans | bool | true | Enable span refinement |
trim_method | str | "fuzzy" | Trimming: "fuzzy", "sliding_bm25", "hybrid" |
min_fuzzy_ratio | float | 0.6 | Minimum fuzzy match quality threshold |
expand_window_neighbors | int | 1 | Expand search to ±N windows if truncated |
template | str | null | Custom LLM existence verification template |
fairness_template | str | null | Custom fairness verification template |
* Note: For backward compatibility, inputs[0] is used if quotes_from is not specified.
Input: Codes OR Themes (containing quotes) Output: Verification results with existence metrics and optional fairness verification
How it Works:
Stage 1: Existence Verification (BM25 + Embeddings)
- Extracts quotes from Codes or Themes
- Creates overlapping windows from search corpus
- Builds BM25 index over windows
- For each quote:
- Finds best BM25 window (with ellipsis support)
- Trims span to align with quote boundaries
- Expands to neighbor windows if truncated
- Computes embedding similarity
- Tracks source document and positions
Stage 1.5: LLM Existence Check (for poor matches)
- Runs LLM-as-judge on quotes with low BM25/cosine scores
- Asks: “Is this quote contained in the source text?”
- Returns explanation + boolean verification
Stage 2: Fairness Verification (optional, themes only)
- If
check_fairness=Trueand input is Themes:- Extracts context window around each quote
- Presents LLM with: Theme + Code + Quote + Context
- Asks: “Is this quote used fairly to support this theme?”
- Returns explanation + boolean fairness judgment
Examples:
Verify Code quotes (backward compatible):
- name: checkquotes
type: VerifyQuotes
quotes_from: codes # or use old 'inputs: [codes]'
window_size: 450
Verify Theme quotes with fairness checking:
- name: verify_themes
type: VerifyQuotes
quotes_from: themes
check_fairness: true
context_window_size: 1000
Search in custom corpus:
- name: verify_in_summaries
type: VerifyQuotes
quotes_from: codes
search_in: summaries # Search in 'summaries' node output instead of documents
Export:
07_VerifyQuotes_checkquotes/
├── quote_verification.xlsx # Formatted Excel (sorted by fairness/confidence)
├── stats.csv # Aggregate statistics
├── info.txt # Algorithm description
├── meta.txt
├── llm_existence_checks/ # LLM prompts/responses for poor matches
│ ├── 0000_{hash}_prompt.md
│ ├── 0000_{hash}_response.txt
│ └── 0000_{hash}_response.json
└── llm_fairness_checks/ # LLM prompts/responses (themes only)
├── 0000_{hash}_prompt.md
├── 0000_{hash}_response.txt
└── 0000_{hash}_response.json
Output Metrics:
For all quotes:
bm25_score: Lexical relevance scorebm25_ratio: Match uniqueness (top1/top2)cosine_similarity: Embedding similarity (0-1)match_ratio: Fuzzy alignment quality (if trimming enabled)source_doc: Source document nameglobal_start,global_end: Character positionsspan_text: Matched text from sourcellm_explanation: LLM explanation (poor matches only)llm_is_contained: Boolean existence verification (poor matches only)
Additionally for themes (if check_fairness=True):
theme: Theme nametheme_description: Theme descriptioncode_name: Code namecode_description: Code descriptionllm_fairness_explanation: LLM explanation for fairnessllm_is_fair: Boolean fairness judgment
Interpreting Results:
Codes: | BM25 Score | BM25 Ratio | Cosine Sim | LLM Contained | Interpretation | |———–|———–|———–|—————|—————-| | High | High | ~1.0 | N/A | ✓ Perfect verbatim match | | High | High | >0.9 | N/A | ✓ Near-exact (minor edits) | | Low | Low | >0.85 | True | ⚠ Poor match but LLM confirms | | Low | Low | <0.7 | False | ✗ Likely hallucination |
Themes: | Existence | Fairness | Interpretation | |———–|———-|—————-| | High BM25/Cosine | True | ✓ Quote exists and supports theme | | High BM25/Cosine | False | ⚠ Quote exists but taken out of context | | Low BM25/Cosine | True | ⚠ Poor match but fair usage | | Low BM25/Cosine | False | ✗ Hallucinated or misused |
See Also:
- Quote Verification Algorithm - Detailed algorithm specification
- Quote Verification Approach - Design rationale
Cluster
Group items by semantic similarity using density-based clustering.
Type: Cluster
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes |
items_field | str | "codes" | Field to extract items from (null for TrackedItems) |
text_field | str | "content" | How to extract text for embedding |
method | ClusterMethod | HDBSCAN defaults | Clustering method configuration |
skip_below | int | 20 | Skip clustering if input count is below this threshold |
if_skipped_bypass_to | str | null | When skip_below triggers, bypass intermediate nodes to this target |
HDBSCAN Method Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | "hdbscan" | Method identifier |
min_cluster_size_proportion | float | null | Target cluster size as proportion (e.g., 0.25 = ~4 clusters) |
min_cluster_size | int | 2 | Hard floor – clusters never smaller than this |
max_cluster_size | int | 100 | Hard ceiling – clusters split if larger (null = no limit) |
min_samples | int | 1 | HDBSCAN min_samples parameter |
Size Control Logic:
min_cluster_size_proportionis a goal – suggests cluster size based on total itemsmin_cluster_sizeis a hard floor – overrides proportion if highermax_cluster_sizeis a hard ceiling – clusters exceeding this are split
Example with 100 items, proportion=0.25, min=10, max=50:
- Proportion suggests min_cluster_size=25 (100 × 0.25)
- Floor of 10 doesn’t apply (25 > 10)
- Effective min_cluster_size = 25
- Clusters larger than 50 get recursively split
Input: List of items (Codes, TrackedItems, or any objects) Output: List of TrackedItems, each representing a cluster
How It Works:
- Extract items using
items_field(e.g., extract Code objects from CodeList) - Extract text using
text_field(content, metadata field, or Jinja2 template) - Compute embeddings for all unique texts
- Run HDBSCAN with calculated effective min_cluster_size
- Handle oversized clusters by recursive splitting (respects max_cluster_size)
- Group singletons (noise points) into batches (see below)
- Return clusters as TrackedItems with original items in metadata
Singleton Handling:
HDBSCAN assigns items that don’t fit well into any cluster as “noise points” (label -1). These singletons are not discarded or placed into an “others” cluster with special naming – instead they are batched together into regular clusters:
- Singletons are grouped into batches of up to
max_cluster_size - If
max_cluster_sizeis null, all singletons go into a single batch - These batches appear as normal clusters (e.g.,
cluster_12,cluster_13) in the output - There is no metadata flag distinguishing singleton-derived clusters from coherent clusters
The only place singleton counts are visible is in the export summary (cluster_summary.txt):
Processing stats:
Singletons (noise points): 42
This means if you have 150 singletons and max_cluster_size=100, you get two additional clusters containing 100 and 50 items respectively, indistinguishable from semantically coherent clusters in downstream processing.
text_field Options:
"content"(default): Uses TrackedItem.content or str(item)"metadata.field_name": Extracts from item.metadata[“field_name”]": ": Jinja2 template for custom text
Examples:
Cluster codes by name and description:
- name: grouped_codes
type: Cluster
inputs: [coded_chunks]
items_field: codes
text_field: ": "
method:
name: hdbscan
min_cluster_size_proportion: 0.25
min_cluster_size: 5
max_cluster_size: 30
Cluster text chunks directly:
- name: grouped_chunks
type: Cluster
inputs: [chunks]
items_field: null # items are TrackedItems, not containers
text_field: content
Large clusters for broad themes:
- name: broad_groups
type: Cluster
inputs: [all_codes]
items_field: codes
method:
name: hdbscan
min_cluster_size_proportion: 0.25 # ~4 groups
max_cluster_size: null # no upper limit
Accessing Cluster Results:
Each output cluster is a TrackedItem with:
content: Stringified cluster items (joined by---)metadata.items: Original items (Code objects, etc.)metadata.cluster_id: e.g., “cluster_0”metadata.cluster_size: Number of items
In downstream templates:
Export:
03_Cluster_grouped_codes/
├── cluster_summary.txt # Statistics and per-cluster sizes
├── cluster_0_content.txt # Stringified cluster content
├── cluster_1_content.txt
├── ...
├── outputs/
│ ├── cluster_0/
│ │ ├── 0000_code-slug.txt # Individual items
│ │ └── 0001_code-slug.txt
│ ├── cluster_1/
│ │ └── ...
└── meta.txt
cluster_summary.txt contents:
Cluster Summary
===============
Total items: 345
Number of clusters: 18
Cluster size min: 5
Cluster size max: 50
Cluster size mean: 19.2
Method: hdbscan
max_cluster_size: 50
min_cluster_size: 10
min_cluster_size_proportion: 0.25
min_samples: 1
effective_min_cluster_size: 86
Processing stats:
Singletons (noise points): 12
Oversized clusters split: 2
Per-cluster sizes:
cluster_47: 50
cluster_12: 48
cluster_3: 32
...
Common Patterns:
Code → Cluster → Consolidate:
nodes:
- name: chunk_codes
type: Map
inputs: [chunks]
- name: grouped_codes
type: Cluster
inputs: [chunk_codes]
items_field: codes
- name: themes
type: Map
inputs: [grouped_codes]
# Each cluster becomes input for theme generation
Hierarchical clustering:
nodes:
# First pass: many small clusters
- name: fine_clusters
type: Cluster
method:
name: hdbscan
min_cluster_size: 3
max_cluster_size: 10
# Consolidate each cluster
- name: consolidated
type: Map
inputs: [fine_clusters]
# Second pass: fewer large clusters
- name: broad_clusters
type: Cluster
inputs: [consolidated]
method:
name: hdbscan
min_cluster_size_proportion: 0.2
max_cluster_size: null
Conditional Bypass for Small Datasets:
When processing small datasets, clustering and consolidation may be unnecessary overhead. Use if_skipped_bypass_to to bypass intermediate nodes when input is below the skip_below threshold:
nodes:
- name: codes_from_chunks
type: Map
inputs: [chunks]
- name: grouped_codes
type: Cluster
inputs: [codes_from_chunks]
items_field: codes
skip_below: 20
if_skipped_bypass_to: final_codes # bypass consolidation
- name: consolidated_codes
type: Map
inputs: [grouped_codes]
- name: final_codes
type: Reduce
inputs: [consolidated_codes]
items_field: codes
Behaviour:
When input has fewer than 20 codes:
grouped_codespasses its input through unchanged (no clustering)consolidated_codesis automatically skippedcodes_from_chunksoutput flows directly tofinal_codes- No LLM calls for clustering or consolidation
When input has 20 or more codes:
- Normal clustering and consolidation flow executes
- All intermediate nodes run as configured
This is useful for pipelines that handle both small pilot studies and large datasets with the same configuration.
Batch
Group items into batches for processing.
Type: Batch
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes |
batch_size | int | 10 | Items per batch |
Input: List of items Output: BatchList (list of lists)
Example:
- name: batched_chunks
type: Batch
batch_size: 5
inputs:
- chunks
Used with Reduce to process batches:
- name: batch_summaries
type: Reduce
inputs:
- batched_chunks
---#batch_summaries
Summarize these chunks together:
GroupBy
Group items by one or more field values, creating nested batch structures.
Type: GroupBy
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes |
group_by | List[str] | Required | Field names to group by |
Input: List of items (TrackedItem or dict with metadata/outputs) Output: BatchList (nested if multiple group_by fields)
How It Works:
- Groups items by values in specified fields
- Fields can be from metadata, outputs, or ChatterResult attributes
- Multiple fields create nested BatchLists
- Each batch contains items sharing same field values
Single Field Example:
- name: by_category
type: GroupBy
group_by:
- category
inputs:
- classified_items
If items have categories [“health”, “tech”, “health”], creates 2 batches:
- Batch 1: Items with category=”health”
- Batch 2: Items with category=”tech”
Multi-Field Example:
- name: by_category_and_sentiment
type: GroupBy
group_by:
- category
- sentiment
inputs:
- classified_items
Creates nested structure:
- health → positive → [items]
- health → negative → [items]
- tech → positive → [items]
- tech → neutral → [items]
Use Cases:
- Process items differently based on classification
- Analyze patterns within categories
- Create hierarchical groupings
- Prepare for category-specific transformations
Ungroup
Flatten all BatchList nesting levels, returning a flat list.
Type: Ungroup
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes (must be BatchList) |
Input: BatchList (any nesting level) Output: Flat list of items
How It Works:
- Recursively flattens all batch nesting
- Preserves original item order
- Removes all grouping structure
Example:
- name: flattened
type: Ungroup
inputs:
- grouped_items
Converts nested structure:
[
[item1, item2], # Batch 1
[item3], # Batch 2
[[item4], [item5]] # Nested batches
]
To flat list:
[item1, item2, item3, item4, item5]
Use Cases:
- Remove grouping after category-specific processing
- Prepare batched results for non-batch-aware nodes
- Flatten before final output
- Combine results from multiple batch levels
Filter
Filter items based on a boolean expression (no LLM call).
Type: Filter
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Node name |
inputs | List[str] | Required | Input nodes (must produce list) |
expression | str | Required | Python expression using simpleeval |
Input: List of items (TrackedItem or ChatterResult) Output: Filtered list (items where expression is truthy)
Filter Modes:
- LLM Mode - Run template through LLM, filter on extracted fields
- Simple Mode - Filter directly on item data
Mode auto-detected: if template provided, uses LLM mode.
LLM Mode Example:
- name: filtered
type: Filter
template: "Is this relevant? [[bool:is_relevant]]"
expression: "is_relevant == True"
inputs: [chunks]
Simple Mode Example:
- name: long_chunks
type: Filter
expression: "len(input) > 100"
inputs: [chunks]
Common Expressions:
# Boolean response (ChatterResult)
"item['decision_node'].response is True"
# Numeric threshold from outputs
"item['score_node'].outputs['score'] > 0.5"
# Multiple conditions
"item['category_node'].outputs['category'] == 'relevant' and item['score_node'].outputs['score'] > 0.3"
Use Case:
Filtering items based on LLM decisions or computed scores without additional LLM calls.
Common Patterns
Parallel Processing
nodes:
- name: chunks
type: Split
- name: process_chunks
type: Map # Processes all chunks in parallel
inputs: [chunks]
Collect and Consolidate
nodes:
- name: chunk_codes
type: Map
- name: all_codes
type: Reduce # Concatenate all outputs
inputs: [chunk_codes]
- name: final_codes
type: Transform # Consolidate into final result
inputs: [all_codes]
Multi-input Transform
nodes:
- name: codes
type: Transform
- name: themes
type: Transform
inputs: [codes] # Uses codes output
- name: narrative
type: Transform
inputs:
- codes # Access both in template
- themes
Classification Pipeline
nodes:
- name: chunks
type: Split
- name: classify
type: Classifier
inputs: [chunks]
Nested Splits
nodes:
- name: chapters
type: Split
chunk_size: 50000
- name: paragraphs
type: Split
chunk_size: 5000
inputs: [chapters] # Split the splits
Provenance: book__chapters__0__paragraphs__2
Template Reference
Available Variables
In all nodes:
- Pipeline
default_contextvariables - All previous node results (by node name)
In ItemsNode (Map, Classifier, Transform):
- `` - Current item content
- `` - Provenance ID
- `` - Item metadata dict
- `` - Full TrackedItem object
In Reduce:
- `` - Each item being reduced
- Named node variables (e.g., ``)
Jinja2 Features
Conditionals:
Loops:
Filters:
Struckdown Syntax
Return types for thematic analysis:
[[code*:codes]] # List[Code]
[[theme*:themes]] # List[Theme]
[[extract:text]] # Free text
[[report]] # Free text (narrative)
Return types for classification:
[[pick:field|a,b,c]] # Single choice
[[pick*:field|a,b]] # Multiple choice (list)
[[int:field]] # Integer
[[boolean:field]] # True/False
[[text:field]] # Free text string
Context control:
¡BEGIN # Start new context
¡OBLIVIATE # Clear context between questions
Node Configuration
Global Config
Set in pipeline front matter:
config:
model_name: openai/gpt-4.1-mini
llm_credentials:
api_key: ${LLM_API_KEY}
base_url: ${LLM_API_BASE}
Per-node Overrides
- name: detailed_analysis
type: Map
model_name: openai/gpt-4o # Override for this node
max_tokens: 16000
temperature: 0.3
inputs: [chunks]
Next Steps
- Pipeline Format - YAML structure
- How-to: Thematic Analysis - Using nodes together
- Node Types Explanation - When to use which node