Thematic Analysis with soak
This guide explains how soak performs inductive thematic analysis using the zs.soak pipeline.
Pipeline Overview
The zs pipeline implements a standard thematic analysis workflow:
- Split: Break documents into manageable chunks
- Map: Generate codes and themes from each chunk independently
- Reduce: Collect all codes and themes
- Transform: Consolidate into final codebook
- Transform: Generate final themes
- VerifyQuotes: Check quotes against source
- Transform: Write narrative report
Running the Pipeline
uv run soak zs data/interviews/*.txt --output results
The pipeline processes all files in parallel, then consolidates results.
Pipeline Stages
Stage 1: chunks (Split)
- name: chunks
type: Split
chunk_size: 30000
Splits documents into ~30,000 character chunks. This keeps LLM context manageable while preserving coherence.
Input: Raw documents Output: List of text chunks with provenance tracking
Stage 2: chunk_codes_and_themes (Map)
This is the core coding stage. Each chunk is analyzed independently.
- name: chunk_codes_and_themes
type: Map
max_tokens: 16000
inputs:
- chunks
Template (excerpt):
You will code the transcript independently, without using a pre-existing codebook.
Identify all relevant codes in the text, provide a Name for each code in 8 to 15 words.
Give a dense Description of the code in 50 words and direct quotes from the participant.
[[code*:codes]]
The [[code*:codes]] syntax uses struckdown to extract structured Code objects:
class Code(BaseModel):
slug: str # Short identifier (max 20 chars)
name: str # Descriptive name
description: str
quotes: List[str] # Example quotes
After coding, the same template generates themes:
Your task now is to group the initial codes into distinct themes.
Provide a descriptive and specific name of 8 to 15 words for each theme.
[[theme*:themes]]
Input: Each chunk Output: List of ChatterResult objects, each containing codes and themes for one chunk
Stage 3: all_codes (Reduce)
- name: all_codes
type: Reduce
inputs:
- chunk_codes_and_themes
Template:
Collects the .codes field from all chunk results into a single text.
Input: All chunk_codes_and_themes results Output: Concatenated text of all codes
Stage 4: all_themes (Reduce)
- name: all_themes
type: Reduce
inputs:
- chunk_codes_and_themes
Template:
Collects the .themes field from all chunk results.
Input: All chunk_codes_and_themes results Output: Concatenated text of all themes
Stage 5: codes (Transform)
Consolidates duplicate codes from different chunks into a single codebook.
- name: codes
type: Transform
max_tokens: 32000
inputs:
- all_codes
- all_themes
Template (excerpt):
We are now going to rationalise the set of codes identified across multiple documents.
## Preliminary codes
First, make a short list of notes on the codes we want to keep. Avoid duplicates.
[[codenotes]]
Now, form this list of new/aligned codes into the required format.
[[code*:codes]]
This two-step process (notes → structured output) helps the LLM deduplicate effectively.
Input: All codes and themes Output: Final consolidated CodeList
Stage 6: themes (Transform)
Generates final themes that reference the consolidated codes.
- name: themes
type: Transform
max_tokens: 32000
inputs:
- codes
- all_themes
Template (excerpt):
Review and consolidate the following preliminary themes into ~7 (+/- 2) overarching major themes.
Complete list of codes identified:
Preliminary themes:
[[theme*:themes]]
Each theme includes code_slugs that reference codes by their slug identifier.
Input: Final codes and preliminary themes Output: Final Themes object
Stage 7: checkquotes (VerifyQuotes)
Validates that quotes in codes actually appear in source documents.
- name: checkquotes
type: VerifyQuotes
inputs:
- codes
Checks each quote against original documents. Invalid quotes are flagged.
Input: Final codes Output: Verification report
Stage 8: narrative (Transform)
Writes a narrative report suitable for publication.
- name: narrative
type: Transform
inputs:
- themes
- codes
Template:
Results of a thematic analysis:
## CODES
## THEMES
Write this up as a standard qualitative report, ready for copying into
the results section of an academic journal.
Be brief - about 1 paragraph per theme. Include quotes for each theme.
[[report]]
Input: Final codes and themes Output: Formatted narrative text
Understanding the Flow
Parallel Processing
Document 1 ──┐
Document 2 ──┼─→ Split ──→ Chunk 1 ──┐
Document 3 ──┘ │
Chunk 2 ──┼─→ Map (codes+themes for each)
Chunk 3 ──┘
│
↓
all_codes, all_themes
│
↓
Consolidate codes
│
↓
Generate final themes
│
↓
Verify quotes
│
↓
Write narrative
Why This Architecture?
Split first: Large documents exceed LLM context limits. Chunking enables processing long texts.
Map chunks independently: Parallel processing is faster. Each chunk gets full attention.
Reduce then consolidate: Collecting all codes first, then deduplicating, produces better results than streaming consolidation.
Two-stage themes: Initial themes from chunks provide diversity. Final consolidation ensures coherence.
Outputs
After running, check results/:
results/
├── 01_Split_chunks/
│ ├── inputs/ # Original documents
│ └── outputs/ # Generated chunks
├── 02_Map_chunk_codes_and_themes/
│ ├── inputs/ # Chunks that were coded
│ ├── 0000_*.json # Full ChatterResult for each chunk
│ └── ...
├── 03_Reduce_all_codes/
├── 04_Reduce_all_themes/
├── 05_Transform_codes/
│ └── result.json # Final CodeList
├── 06_Transform_themes/
│ └── result.json # Final Themes
├── 07_VerifyQuotes_checkquotes/
│ └── verification.txt
└── 08_Transform_narrative/
└── result.txt # Report text
The results.json and results.html files combine key outputs for easy viewing.
Customization
Adjust Chunk Size
Smaller chunks = more granular coding, longer runtime:
- name: chunks
type: Split
chunk_size: 15000 # Half the default
Change Model
uv run soak zs data/*.txt --output results --model-name openai/gpt-4o
Modify Prompts
Copy soak/pipelines/zs.soak locally and edit templates:
cp soak/pipelines/zs.soak my_analysis.soak
# Edit my_analysis.soak
uv run soak my_analysis.soak data/*.txt --output results
See Customizing Your Analysis for details.
Add Context Variables
uv run soak zs data/*.txt \
--output results \
-c research_question="What are participants' experiences of recovery?"
The research_question variable is injected into templates via ``.
Common Issues
Codes are too similar across chunks
The consolidation step should deduplicate, but if codes remain too granular, try:
- Larger chunk_size (fewer chunks = fewer duplicate codes)
- More specific instructions in the consolidation template
Quotes don’t verify
VerifyQuotes fails when:
- LLM paraphrased instead of quoting verbatim
- Quotes span chunk boundaries
Solutions:
- Emphasize “exact quotes” in template
- Use larger chunks to reduce boundary issues
- Review
07_VerifyQuotes_checkquotes/verification.txtfor details
Too many/few themes
Adjust the template instruction:
Review and consolidate into ~5 (+/- 1) overarching themes. # Fewer themes
Out of memory
- Process fewer files at once
- Reduce
chunk_size - Lower
MAX_CONCURRENCY:export MAX_CONCURRENCY=5
Next Steps
- Pre-extraction Workflow - Filter text before analysis
- Customizing Your Analysis - Adapt prompts
- Node Types - Understand Map/Reduce/Transform