Getting Started with soak
This tutorial walks you through installing soak and running your first thematic analysis.
Prerequisites
- 3.12+
- uv package manager
- API key for an OpenAI-compatible LLM provider
Installation
uv tool install soaking
Run the test command to set environment variables:
soak test
This will save these variables to your .env file:
LLM_API_KEY=your_api_key
LLM_API_BASE=https://api.openai.com/v1 # Optional
Your First Analysis
1. Prepare Your Data
Create a text file with some interview data. For this example, create data/interview.txt:
I started feeling ill about two years ago. At first it was just fatigue,
but then I couldn't get out of bed. The doctors didn't know what was wrong.
Eventually I was diagnosed with CFS. It was a relief to have a name for it,
but also scary because there's no cure. I've had to completely change my life.
The hardest part is that people don't understand. They think I'm just tired.
But this is different - it's like my body just stopped working properly.
2. Run the Analysis
Use the built-in zs (zero-shot) pipeline for thematic analysis, based on Raza et al 2025.
uv run soak zs "soak-data/UKDA-2000-tab/rtf/2000int00*.rtf" --output social_history -f
You can also run the analysis in a Python script. In this instance, we run multiple versions of the analysis, each with different directions:
#!/usr/bin/env python3
from soak import api
import json
from pathlib import Path
cfg = json.loads(Path("perspectives.json").read_text())
for i, (name, block) in enumerate(cfg["perspectives"].items(), 1):
result = api.run(
"zs",
"soak-data/UKDA-2000-tab/rtf/2000int00*.rtf",
context=block["context"],
output=f"history_{name}",
seed=i,
skip_nodes=["checkquotes"],
force=True,
progress=True
)
print(f"{name}: {len(result.themes)} themes")
This will:
- Split the text into chunks
- Generate codes from each chunk
- Identify themes
- Consolidate codes and themes
- Verify quotes
- Write a narrative report
The process takes a few minutes depending on text length of the interview.
3. View Results
Open the HTML output:
open my_first_analysis.html
You’ll see:
- Codes: Specific concepts identified in the text (with quotes)
- Themes: Broader patterns grouping codes
- Narrative: A written report of findings
A JSON file containing all the model output and logging all LLM calls is also available:
cat my_first_analysis.json | jq '.codes'
Understanding the Output
Codes
Each code has:
- slug: Short identifier (e.g.,
illness_onset) - name: Descriptive name (e.g., “Gradual onset of unexplained symptoms”)
- description: What the code represents
- quotes: Example text from your data
Example:
{
"slug": "social_misunderstanding",
"name": "Others fail to grasp the severity of the condition",
"description": "Participants describe frustration when family, friends...",
"quotes": [
"people don't understand. They think I'm just tired."
]
}
Themes
Themes group related codes:
{
"name": "Living with chronic illness uncertainty",
"description": "Participants navigate the challenges of...",
"code_slugs": ["illness_onset", "diagnosis_relief", "lifestyle_changes"]
}
Narrative
A written report (sort of) ready for your results section:
Living with chronic illness uncertainty: Participants described a gradual onset of symptoms that were initially unexplained. As one participant noted, “At first it was just fatigue, but then I couldn’t get out of bed”…
Next Steps
- Customize the analysis: See Customizing Your Analysis
- Understand the pipeline: See Thematic Analysis How-to
- Work with multiple files:
soak zs data/*.txt --output results - Try classification: See Build a Classifier