What is soak?

soak is Python package for LLM-assisted qualitative text analysis. It automates coding, theme generation, and other text processing tasks while maintaining researcher control over the process. a

Purpose

Qualitative analysis—coding interviews, identifying themes, analyzing open-ended survey responses—is valuable but time-consuming. LLMs can assist, but using them effectively requires:

Structured workflows: Breaking analysis into clear stages
Parallel processing: Handling long documents by chunking
Consolidation: Merging results from parallel processes
Provenance: Tracking which text produced which codes
Verification: Checking quotes against sources

soak provides infrastructure. You write analysis prompts; soak handles execution, data flow, and bookkeeping. this

What soak Does

Thematic Analysis

Process interview transcripts or other qualitative data:

uv run soak zs data/interviews/*.txt --output results

Produces:

Codes with quotes from your data
Themes grouping related codes
Narrative report ready for publication
Verified quotes with source tracking

Classification

Extract structured data from text:

uv run soak classifier --output results data/documents/*.docx

Produces:

CSV with classifications for each document/chunk
Summary statistics
Source tracking showing which text was classified

Custom Pipelines

Define your own analysis workflow:

nodes:
  - name: summaries
    type: Map
    inputs: [documents]

  - name: comparison
    type: Transform
    inputs: [summaries]

What soak Doesn’t Do

Not a black box: You write the prompts. soak executes but doesn’t impose analysis methods. them

Not a replacement for expertise: LLMs assist; researchers interpret. soak helps work faster, not think less. you

Not a chatbot: soak runs predefined pipelines. It’s not interactive Q&A about your data.

When to Use soak

Use soak when:

You have qualitative text data (interviews, surveys, documents)
You want systematic, reproducible analysis
You need to process more data than manual coding allows
You want to try multiple analysis approaches (different prompts, models)
You need source tracking and quote verification

Don’t use soak when:

You have small datasets (< 5 documents) where manual coding is faster
Your analysis requires deep contextual knowledge LLMs can’t provide
You need real-time, interactive exploration (use a chatbot instead)

Design Philosophy

DAG-based Pipelines

soak uses acyclic graphs (DAGs) to represent analysis workflows. Each node is a processing step; edges represent data dependencies. directed

Benefits:

Parallel execution: Independent steps run concurrently
Reproducibility: Same pipeline + data = same results
Modularity: Reuse nodes across pipelines
Visibility: See exactly what happens at each stage

See DAG Architecture for details.

Template-driven

Analysis logic lives in templates, not code:

---#codes
Read this text and identify key themes:

Generate codes: [[code*:codes]]

Templates use Jinja2 for logic and struckdown for structured outputs. Non-programmers can write them.

See Template System for details.

Provenance-first

Every piece of text tracks its origin:

interview_001.txt → chunks__0 → codes with source_id="chunks__0"

You can always trace results back to source text. Export formats include source tracking by default.

See Provenance Tracking for details.

Common Workflows

Inductive Thematic Analysis

Split documents into chunks
Code each chunk independently (Map)
Collect all codes (Reduce)
Consolidate duplicates (Transform)
Generate themes (Transform)
Verify quotes (VerifyQuotes)
Write narrative (Transform)

See Thematic Analysis

Classification with Agreement

Split documents if needed
Classify with multiple models (Classifier with model_names)
Calculate inter-rater agreement
Export results with source tracking

See Multi-model Agreement

Pre-extraction Analysis

Extract relevant sections from long documents (Map)
Code the extracted sections (Map)
Continue as in standard thematic analysis

See Pre-extraction Workflow

Architecture Overview

User writes:           soak provides:
─────────────          ──────────────
Pipeline YAML    ───→  DAG executor
Templates        ───→  Jinja2 + struckdown rendering
Research Qs      ───→  Context injection

                       Document loading (PDF, DOCX, TXT)
                       Parallel processing
                       Rate limiting
                       Caching
                       Export (JSON, HTML, CSV)
                       Provenance tracking

Example: From Pipeline to Results

Pipeline (my_analysis.soak):

nodes:
  - name: chunks
    type: Split
    chunk_size: 10000

  - name: codes
    type: Map
    inputs: [chunks]

---#codes
Generate codes from this text:


[[code*:codes]]

Command:

uv run soak my_analysis.soak data/interview.txt --output results

What happens:

Load interview.txt, create TrackedItem with source_id=”interview”
Split into chunks: “interview__chunks__0”, “interview__chunks__1”, …
Render template for each chunk with `` = chunk content
Send prompts to LLM in parallel (respecting rate limits)
Parse responses into Code objects via struckdown
Export results with source tracking
Generate HTML view of all codes

Result:

results/
├── 01_Split_chunks/
│   ├── inputs/0000_interview.txt
│   └── outputs/0000_interview__chunks__0.txt
├── 02_Map_codes/
│   ├── inputs/0000_interview__chunks__0.txt
│   └── 0000_interview__chunks__0_response.json
└── results.html

Next Steps

Getting Started - Run your first analysis
Thematic Analysis - Detailed workflow
Node Types - Understanding processing nodes
DAG Architecture - How pipelines execute