Node Types
This document explains the different categories of nodes in soak and when to use each type.
Node Categories
soak provides several node types that fall into distinct categories based on their role in data processing:
1. Input Processing Nodes
Split - Divide documents into smaller pieces
Use when:
- Documents are too large for LLM context windows
- You want to process text in manageable chunks
- You need granular analysis (sentence-level, paragraph-level)
- name: chunks
type: Split
chunk_size: 30000
split_unit: characters # or sentences, paragraphs
2. Transformation Nodes
Map - Apply operation to each item independently in parallel
Use when:
- Processing each item separately (no cross-item information needed)
- Running the same prompt on multiple chunks
- Maximum parallelization is desired
- name: code_chunks
type: Map
inputs:
- chunks
Transform - Apply operation to single aggregated input
Use when:
- Consolidating multiple results into one output
- Generating summaries or final reports
- Processing needs context from all inputs
- name: final_codes
type: Transform
inputs:
- all_codes
TransformReduce - Reduce then transform in one step
Use when:
- You need both reduction and transformation
- Want to avoid intermediate node
- name: consolidated
type: TransformReduce
inputs:
- chunk_results
3. Aggregation Nodes
Reduce - Collect and concatenate results from multiple items
Use when:
- Gathering all outputs into single text
- Preparing for consolidation step
- Simple aggregation without LLM processing
- name: all_codes
type: Reduce
inputs:
- chunk_codes
4. Semantic Grouping Nodes
Cluster - Groups items by semantic similarity
Use when:
- You have many codes/items and want to group similar ones
- Preparing for consolidation (e.g., merging similar codes into themes)
- Exploring natural groupings in qualitative data
- Reducing the number of items for downstream processing
For example: in a large corpus, coding each document may produce a large number of overlapping codes. To reduce duplication, we can cluster codes into related groups and then ask an LLM to generate a new code or codes for each group (this avoid presenting the whole list of codes to the LLM at once).
- name: grouped_codes
type: Cluster
inputs: [coded_documents]. # only 1 input is supported
items_field: codes # extract the `codes` field from each input
method:
name: hdbscan
min_cluster_size_proportion: 0.25 # aim for ~4 clusters
min_cluster_size: 5 # but at least 5 per cluster
max_cluster_size: 50 # and no more than 50
How it works:
- Extracts text from each item (using
text_field) or the content ofitems_field(converted to text by default, the template can be customised) - Computes embeddings for all items
- Runs HDBSCAN density-based clustering
- Splits oversized clusters, groups noise points
- Returns clusters as TrackedItems (each containing grouped items)
About HDBSCAN
HDBSCAN is used here as a practical way to group things that look similar, without assuming that everything must fit neatly into a category. It groups items when there is enough shared structure between them and leaves things ungrouped when there isn’t. Users can set the minimum and maximum cluster sizes (soak will recusively re-split oversized clusters into smaller ones if needed).
5. Structuring Nodes
Batch - Group items by criteria
Use when:
- Processing items as groups
- Organizing by document, category, or metadata
- Creating hierarchical structure
- name: by_document
type: Batch
batch_by: doc_index
inputs:
- chunks
GroupBy - Group items by field values
Use when:
- Creating multiple groups from single input
- Organizing by multi-field criteria
- Building nested batch structure
- name: by_category
type: GroupBy
group_by:
- category
- subcategory
inputs:
- classified_items
Ungroup - Flatten all batch nesting
Use when:
- Converting BatchList back to flat list
- Removing all grouping structure
- name: flattened
type: Ungroup
inputs:
- grouped_items
5. Analysis Nodes
Classifier - Extract structured categorical data
Use when:
- Assigning categories or labels
- Extracting ratings or scores
- Running multi-model agreement analysis
- name: classify
type: Classifier
model_names:
- gpt-4o-mini
- gpt-4o
agreement_fields:
- topic
inputs:
- documents
VerifyQuotes - Validate quotes against sources
Use when:
- Checking quote accuracy in qualitative analysis
- Ensuring LLM used verbatim quotes
- Identifying paraphrasing or hallucinations
- name: checkquotes
type: VerifyQuotes
inputs:
- codes
6. Filtering Nodes
Filter - Keep/remove items based on conditions
Use when:
- Removing irrelevant items
- Selecting subsets for further processing
- Implementing quality checks
- name: relevant_only
type: Filter
inputs:
- classified
- relevance_check
Choosing the Right Node Type
Question: How many inputs, how many outputs?
Many inputs → Many outputs: Use Map
- Example: Code each chunk independently
Many inputs → One output: Use Reduce or Transform
- Example: Collect all codes into final codebook
One input → Many outputs: Use Split
- Example: Break document into paragraphs
One input → One output: Use Transform
- Example: Generate narrative report
Question: Do I need an LLM?
Yes: For Map, Transform, TransformReduce, or Classifier
- These nodes have templates and call LLMs
No: For Split, Reduce, Cluster, Batch, GroupBy, Ungroup, or Filter
- These nodes do structural operations only (Cluster uses embeddings, not an LLM)
Question: Do items need context from other items?
No (independent): Use Map
- Faster due to parallelization
- Each item processed separately
Yes (dependent): Use Transform
- All inputs combined before processing
- Slower but enables cross-referencing
Question: Am I organizing or analyzing?
Organizing data structure:
- Split: Break apart
- Cluster: Group by semantic similarity
- Batch/GroupBy: Group by metadata/fields
- Ungroup: Flatten structure
- Filter: Remove items
Analyzing content:
- Map: Process items in parallel
- Transform: Consolidate/generate
- Classifier: Extract structured data
- VerifyQuotes: Validate quotes
Common Node Patterns
Pattern 1: Split-Map-Reduce-Transform
Classic qualitative analysis pattern:
nodes:
# Break documents into chunks
- name: chunks
type: Split
chunk_size: 30000
# Code each chunk independently
- name: chunk_codes
type: Map
inputs:
- chunks
# Collect all codes
- name: all_codes
type: Reduce
inputs:
- chunk_codes
# Consolidate into final codebook
- name: final_codes
type: Transform
inputs:
- all_codes
Pattern 2: Batch-Map-Reduce
Process documents separately, then combine:
nodes:
# Group chunks by document
- name: by_document
type: Batch
batch_by: doc_index
inputs:
- chunks
# Code within each document
- name: document_codes
type: Map
inputs:
- by_document
# Aggregate across documents
- name: all_codes
type: Reduce
inputs:
- document_codes
Pattern 3: Classify-GroupBy-Transform
Categorize then process by category:
nodes:
# Classify items
- name: classified
type: Classifier
inputs:
- items
# Group by classification
- name: by_category
type: GroupBy
group_by:
- category
inputs:
- classified
# Analyze each category
- name: category_analysis
type: Map
inputs:
- by_category
Pattern 4: Map-Filter-Transform
Generate candidates, filter, consolidate:
nodes:
# Generate relevance checks
- name: relevance
type: Map
inputs:
- chunks
# Keep only relevant items
- name: relevant_chunks
type: Filter
inputs:
- chunks
- relevance
# Process filtered items
- name: analysis
type: Transform
inputs:
- relevant_chunks
Node Execution Behavior
Parallelization
Parallel execution (multiple items at once):
- Map
- Classifier (within model)
Sequential execution (one item/batch at a time):
- Transform
- Reduce
- Split
- Filter
- VerifyQuotes
Batch-level parallelization (independent batches in parallel):
- All nodes respect DAG dependency batching
Memory Considerations
Low memory (streaming):
- Reduce (concatenates text incrementally)
- Filter (drops items as processed)
High memory (accumulates):
- Transform (loads all inputs)
- Map (stores all results)
- Classifier (especially multi-model)
Controlled memory:
- Split (processes one document at a time)
- Batch (groups items but doesn’t duplicate)
Extending with Custom Nodes
All nodes inherit from base classes:
from soak.models.nodes.base import CompletionDAGNode, ItemsNode
class MyCustomNode(ItemsNode, CompletionDAGNode):
"""Custom node with LLM completion."""
type: Literal["MyCustomNode"] = "MyCustomNode"
async def run(self):
items = await self.get_items()
# Custom processing logic
return results
Node Type Reference Table
| Node Type | Inputs | Outputs | Uses LLM | Parallelizes | Use Case |
|---|---|---|---|---|---|
| Split | 1 | Many | No | No | Break documents into chunks |
| Map | Many | Many | Yes | Yes | Process items independently |
| Reduce | Many | 1 | No | No | Collect results into text |
| Transform | 1+ | 1 | Yes | No | Consolidate and generate |
| TransformReduce | Many | 1 | Yes | No | Reduce + transform combined |
| Cluster | Many | Many | No | No | Group by semantic similarity |
| Batch | Many | Many | No | No | Group by metadata field |
| GroupBy | Many | Many | No | No | Group by multiple fields |
| Ungroup | Many | Many | No | No | Flatten batch structure |
| Classifier | Many | Many | Yes | Yes | Structured classification |
| Filter | Many | Many | No | No | Remove items by condition |
| VerifyQuotes | 1 | 1 | No | Yes | Validate quotes vs sources |
Next Steps
- Node Reference - Detailed node parameters
- DAG Architecture - How nodes execute in pipeline
- Template System - How nodes use templates