What is soak?
soak is Python package for LLM-assisted qualitative text analysis. It automates coding, theme generation, and other text processing tasks while maintaining researcher control over the process. a
Purpose
Qualitative analysis—coding interviews, identifying themes, analyzing open-ended survey responses—is valuable but time-consuming. LLMs can assist, but using them effectively requires:
- Structured workflows: Breaking analysis into clear stages
- Parallel processing: Handling long documents by chunking
- Consolidation: Merging results from parallel processes
- Provenance: Tracking which text produced which codes
- Verification: Checking quotes against sources
soak provides infrastructure. You write analysis prompts; soak handles execution, data flow, and bookkeeping. this
What soak Does
Thematic Analysis
Process interview transcripts or other qualitative data:
uv run soak zs data/interviews/*.txt --output results
Produces:
- Codes with quotes from your data
- Themes grouping related codes
- Narrative report ready for publication
- Verified quotes with source tracking
Classification
Extract structured data from text:
uv run soak classifier --output results data/documents/*.docx
Produces:
- CSV with classifications for each document/chunk
- Summary statistics
- Source tracking showing which text was classified
Custom Pipelines
Define your own analysis workflow:
nodes:
- name: summaries
type: Map
inputs: [documents]
- name: comparison
type: Transform
inputs: [summaries]
What soak Doesn’t Do
Not a black box: You write the prompts. soak executes but doesn’t impose analysis methods. them
Not a replacement for expertise: LLMs assist; researchers interpret. soak helps work faster, not think less. you
Not a chatbot: soak runs predefined pipelines. It’s not interactive Q&A about your data.
When to Use soak
Use soak when:
- You have qualitative text data (interviews, surveys, documents)
- You want systematic, reproducible analysis
- You need to process more data than manual coding allows
- You want to try multiple analysis approaches (different prompts, models)
- You need source tracking and quote verification
Don’t use soak when:
- You have small datasets (< 5 documents) where manual coding is faster
- Your analysis requires deep contextual knowledge LLMs can’t provide
- You need real-time, interactive exploration (use a chatbot instead)
Design Philosophy
DAG-based Pipelines
soak uses acyclic graphs (DAGs) to represent analysis workflows. Each node is a processing step; edges represent data dependencies. directed
Benefits:
- Parallel execution: Independent steps run concurrently
- Reproducibility: Same pipeline + data = same results
- Modularity: Reuse nodes across pipelines
- Visibility: See exactly what happens at each stage
See DAG Architecture for details.
Template-driven
Analysis logic lives in templates, not code:
---#codes
Read this text and identify key themes:
Generate codes: [[code*:codes]]
Templates use Jinja2 for logic and struckdown for structured outputs. Non-programmers can write them.
See Template System for details.
Provenance-first
Every piece of text tracks its origin:
interview_001.txt → chunks__0 → codes with source_id="chunks__0"
You can always trace results back to source text. Export formats include source tracking by default.
See Provenance Tracking for details.
Common Workflows
Inductive Thematic Analysis
- Split documents into chunks
- Code each chunk independently (Map)
- Collect all codes (Reduce)
- Consolidate duplicates (Transform)
- Generate themes (Transform)
- Verify quotes (VerifyQuotes)
- Write narrative (Transform)
Classification with Agreement
- Split documents if needed
- Classify with multiple models (Classifier with model_names)
- Calculate inter-rater agreement
- Export results with source tracking
Pre-extraction Analysis
- Extract relevant sections from long documents (Map)
- Code the extracted sections (Map)
- Continue as in standard thematic analysis
Architecture Overview
User writes: soak provides:
───────────── ──────────────
Pipeline YAML ───→ DAG executor
Templates ───→ Jinja2 + struckdown rendering
Research Qs ───→ Context injection
Document loading (PDF, DOCX, TXT)
Parallel processing
Rate limiting
Caching
Export (JSON, HTML, CSV)
Provenance tracking
Example: From Pipeline to Results
Pipeline (my_analysis.soak):
nodes:
- name: chunks
type: Split
chunk_size: 10000
- name: codes
type: Map
inputs: [chunks]
---#codes
Generate codes from this text:
[[code*:codes]]
Command:
uv run soak my_analysis.soak data/interview.txt --output results
What happens:
- Load
interview.txt, create TrackedItem with source_id=”interview” - Split into chunks: “interview__chunks__0”, “interview__chunks__1”, …
- Render template for each chunk with `` = chunk content
- Send prompts to LLM in parallel (respecting rate limits)
- Parse responses into
Codeobjects via struckdown - Export results with source tracking
- Generate HTML view of all codes
Result:
results/
├── 01_Split_chunks/
│ ├── inputs/0000_interview.txt
│ └── outputs/0000_interview__chunks__0.txt
├── 02_Map_codes/
│ ├── inputs/0000_interview__chunks__0.txt
│ └── 0000_interview__chunks__0_response.json
└── results.html
Next Steps
- Getting Started - Run your first analysis
- Thematic Analysis - Detailed workflow
- Node Types - Understanding processing nodes
- DAG Architecture - How pipelines execute