Working with Results
This tutorial shows how to analyze and interpret soak output data.
Output Formats
After running a pipeline, you get two files:
uv run soak zs data/*.txt --output results
# Creates:
results.json # Full pipeline data
results.html # Rendered view
JSON Structure
{
"name": "zero_shot",
"config": { ... },
"nodes": [
{
"name": "chunks",
"type": "Split",
"result": [ ... ]
},
{
"name": "codes",
"type": "Transform",
"result": {
"codes": [ ... ]
}
},
...
]
}
Each node’s result is stored. Access via:
# Get final codes
cat results.json | jq '.nodes[] | select(.name=="codes") | .result.codes'
# Get narrative
cat results.json | jq '.nodes[] | select(.name=="narrative") | .result.report'
# Count themes
cat results.json | jq '.nodes[] | select(.name=="themes") | .result.themes | length'
HTML View
Open in browser:
open results.html
Shows codes, themes, and narrative in readable format.
Understanding Codes
A Code object:
{
"slug": "medical_dismissal",
"name": "Experiences of being dismissed by healthcare providers",
"description": "Participants describe frustration when doctors minimize...",
"quotes": [
"The doctor said it was all in my head",
"They told me to just exercise more..."
]
}
Fields:
slug- Short identifier (max 20 chars, a-z only)name- Descriptive name (8-15 words)description- What the code represents (~50 words)quotes- Example text from your data
Extracting Codes
All code names:
cat results.json | jq '.nodes[] | select(.name=="codes") | .result.codes[].name'
Codes with quotes:
cat results.json | jq '.nodes[] | select(.name=="codes") | .result.codes[] | {name, quotes}'
Find specific code:
cat results.json | jq '.nodes[] | select(.name=="codes") | .result.codes[] | select(.slug=="medical_dismissal")'
Working with Codes in Python
import json
with open("results.json") as f:
data = json.load(f)
# Find codes node
codes_node = next(n for n in data["nodes"] if n["name"] == "codes")
codes = codes_node["result"]["codes"]
# Print all code names
for code in codes:
print(f"- {code['name']}")
# Codes by quote count
sorted_codes = sorted(codes, key=lambda c: len(c["quotes"]), reverse=True)
print(f"Most quoted: {sorted_codes[0]['name']} ({len(sorted_codes[0]['quotes'])} quotes)")
# Export to CSV
import csv
with open("codes.csv", "w") as f:
writer = csv.DictWriter(f, fieldnames=["slug", "name", "description", "quote_count"])
writer.writeheader()
for code in codes:
writer.writerow({
"slug": code["slug"],
"name": code["name"],
"description": code["description"],
"quote_count": len(code["quotes"])
})
Understanding Themes
A Theme object:
{
"name": "Navigating medical system barriers",
"description": "Participants struggle to access appropriate care due to...",
"code_slugs": ["medical_dismissal", "diagnostic_delay", "treatment_access"]
}
Fields:
name- Theme name (8-15 words)description- What the theme represents (60-80 words)code_slugs- References to codes (by slug)
Extracting Themes
All theme names:
cat results.json | jq '.nodes[] | select(.name=="themes") | .result.themes[].name'
Themes with codes:
cat results.json | jq '.nodes[] | select(.name=="themes") | .result.themes[] | {name, code_slugs}'
Linking Themes to Codes
import json
with open("results.json") as f:
data = json.load(f)
codes_node = next(n for n in data["nodes"] if n["name"] == "codes")
themes_node = next(n for n in data["nodes"] if n["name"] == "themes")
codes = {c["slug"]: c for c in codes_node["result"]["codes"]}
themes = themes_node["result"]["themes"]
# Print themes with their codes
for theme in themes:
print(f"\n{theme['name']}")
print(f" {theme['description']}")
print(f" Codes:")
for slug in theme["code_slugs"]:
code = codes[slug]
print(f" - {code['name']}")
print(f" {code['quotes'][0][:100]}...") # First quote preview
Understanding the Narrative
The narrative is formatted text ready for publication:
cat results.json | jq -r '.nodes[] | select(.name=="narrative") | .result.report'
Output:
**Theme 1: Living with uncertainty**: Participants described prolonged periods
without diagnosis, leading to anxiety... "I didn't know what was wrong with me
for three years."
**Theme 2: Medical system barriers**: Access to appropriate care was
challenging...
Copy directly into your results section.
Detailed Execution Dump
When you run a pipeline, a detailed execution dump is automatically created:
uv run soak zs data/*.txt --output results
This creates both output files and a results_dump/ folder:
results_dump/
├── 01_Split_chunks/
│ ├── inputs/
│ │ ├── 0000_interview_001.txt
│ │ └── 0000_interview_001_metadata.json
│ ├── outputs/
│ │ ├── 0000_interview_001__chunks__0.txt
│ │ └── 0000_interview_001__chunks__0_metadata.json
│ └── split_summary.txt
├── 02_Map_chunk_codes_and_themes/
│ ├── inputs/
│ ├── 0000_interview_001__chunks__0_prompt.md
│ ├── 0000_interview_001__chunks__0_response.json
│ └── ...
└── metadata.json
Inspecting Node Outputs
View a specific chunk:
cat results_dump/01_Split_chunks/outputs/0000_interview_001__chunks__0.txt
See LLM prompt:
cat results_dump/02_Map_chunk_codes_and_themes/0000_*_prompt.md
See LLM response:
cat results_dump/02_Map_chunk_codes_and_themes/0000_*_response.json | jq
Tracking Provenance
Each file includes source_id in filename:
0000_interview_001__chunks__0.txt
└─────┬─────┘ └──┬──┘ └┬┘
document node chunk
Trace a code’s quote back to source:
# Find code
code = codes[0]
quote = code["quotes"][0]
# Search in chunk outputs
import os
for file in os.listdir("results_dump/01_Split_chunks/outputs/"):
if file.endswith(".txt"):
with open(f"results_dump/01_Split_chunks/outputs/{file}") as f:
if quote in f.read():
print(f"Found in: {file}")
Analyzing Classifications
For classifier pipelines, outputs include CSV:
uv run soak classifier data/*.txt --output results
Check results_dump/XX_Classifier_*/classifications.csv:
index,source_id,doc_index,original_file,topic,sentiment,positivity
0,interview_001__sentences__0,0,data/interview_001.txt,health,negative,2
1,interview_001__sentences__1,0,data/interview_001.txt,health,neutral,3
2,interview_002__sentences__0,1,data/interview_002.txt,tech,positive,4
Analyzing Classifications in Python
import pandas as pd
df = pd.read_csv("results_dump/XX_Classifier_*/classifications.csv")
# Distribution
print(df['topic'].value_counts())
print(df['sentiment'].value_counts())
# Cross-tabulation
print(pd.crosstab(df['topic'], df['sentiment']))
# By document
by_doc = df.groupby('original_file').agg({
'topic': lambda x: x.mode()[0], # Most common topic
'sentiment': lambda x: x.mode()[0],
'positivity': 'mean'
})
print(by_doc)
# Find specific classifications
health_negative = df[(df['topic'] == 'health') & (df['sentiment'] == 'negative')]
print(health_negative['source_id'])
Quote Verification
If your pipeline includes VerifyQuotes:
cat results_dump/XX_VerifyQuotes_*/verification.txt
Shows which quotes failed verification:
Verifying 45 quotes...
✓ 42 quotes verified
✗ 3 quotes failed:
Code: medical_dismissal
Quote: "The doctor said it was in my head"
Reason: Not found in source documents (possible paraphrase)
Fix by:
- Checking original quote in LLM response
- Emphasizing “verbatim quotes” in template
- Reviewing chunks for quote boundaries
Comparing Multiple Runs
Run same pipeline multiple times with different parameters:
uv run soak zs data/*.txt -o run1
uv run soak zs data/*.txt -o run2 --model-name openai/gpt-4o
uv run soak zs data/*.txt -o run3 -c persona="Clinical psychologist"
Compare:
uv run soak compare run1.json run2.json run3.json -o comparison.html
open comparison.html
Shows:
- Theme similarity heatmaps
- Network plots of overlapping themes
- Agreement statistics
Exporting for Analysis
Convert to DataFrame
import json
import pandas as pd
with open("results.json") as f:
data = json.load(f)
codes_node = next(n for n in data["nodes"] if n["name"] == "codes")
codes = codes_node["result"]["codes"]
# Flatten codes
rows = []
for code in codes:
for quote in code["quotes"]:
rows.append({
"code_slug": code["slug"],
"code_name": code["name"],
"quote": quote
})
df = pd.DataFrame(rows)
df.to_csv("codes_with_quotes.csv", index=False)
Import to NVivo/Atlas.ti
Export codes as CSV for import:
import csv
with open("codes_for_nvivo.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Code", "Description", "Example"])
for code in codes:
writer.writerow([
code["name"],
code["description"],
"; ".join(code["quotes"][:3]) # First 3 quotes
])
Tips
Find longest codes:
cat results.json | jq '.nodes[] | select(.name=="codes") | .result.codes | sort_by(.description | length) | reverse | .[0:3] | .[].name'
Count total quotes:
cat results.json | jq '[.nodes[] | select(.name=="codes") | .result.codes[].quotes[]] | length'
Extract just narrative:
cat results.json | jq -r '.nodes[] | select(.name=="narrative") | .result.report' > narrative.md
Inspect specific node:
# What did the 'all_codes' Reduce produce?
cat results.json | jq '.nodes[] | select(.name=="all_codes") | .result'
Next Steps
- Thematic Analysis - Understanding the pipeline
- Customizing Your Analysis - Adapting prompts
- Node Reference - All node types and outputs