CLI Reference

soak provides a command-line interface for running pipelines and working with results.

Global Options

These options apply to all commands:

Option	Short	Description
`--verbose`	`-v`	Increase verbosity (`-v` = INFO, `-vv` = DEBUG)
`--install-completion`		Install shell completion for the current shell
`--show-completion`		Show completion script for the current shell
`--help`		Show help message and exit

Commands

Command	Description
`run`	Run a pipeline on input files
`compare`	Compare analyses or string lists and generate comparison statistics
`show`	Show the contents of a built-in pipeline or template
`coverage`	Analyse how well themes from an analysis are represented across documents

run

Run a pipeline on input files.

uv run soak PIPELINE INPUT_FILES [OPTIONS]

Arguments:

PIPELINE - Pipeline name (e.g., zs, classifier) or path to YAML file
INPUT_FILES - One or more file paths or glob patterns (e.g., data/*.txt, interviews.zip)

Options:

Option	Short	Description
`--model MODEL`	`-m`	Model configuration (see Model Aliases below)
`--output PATH`	`-o`	Output file path without extension (default: derived from pipeline name)
`--template NAME`	`-t`	Template name in `soak/templates/` or path to custom template (default: `pipeline.html`, can be used multiple times)
`--include-documents`		Include original document text in JSON output
`--context KEY=VALUE`	`-c`	Override context variables (can be used multiple times)
`--force`	`-f`	Overwrite existing output files/folders (checked before pipeline runs)
`--sample N`	`-S`	Randomly sample N rows/documents from input (mutually exclusive with `--head`)
`--head N`	`-H`	Take first N rows/documents from input (mutually exclusive with `--sample`)
`--seed N`		Random seed for reproducible outputs and document shuffling (default: 42)
`--progress/--no-progress`		Show progress bars (auto-detected: enabled for TTY, disabled with -vv)
`--timeout N`		Timeout in seconds for individual LLM API calls (default: 90)
`--skip-node NAME`		Skip specified node(s) during execution (can be used multiple times)
`--stop-at NAME`		Stop execution before the specified node runs

Examples:

# Basic usage (creates zs.json and zs_pipeline.html)
uv run soak zs data/interview.txt

# Specify custom output name
uv run soak zs --output results data/interview.txt

# Multiple files
uv run soak zs --output analysis data/*.txt

# Set default model
uv run soak zs --output results --model gpt-4.1 data/*.txt

# Override specific model aliases
uv run soak zs --output results --model default=gpt-4.1-mini --model best=gpt-4.1 data/*.txt

# Override context variables
uv run soak zspe -o results data/*.txt \
  -c research_question="What are recovery experiences?" \
  -c excerpt_topics="Exercise and rehabilitation"

# Use custom template
uv run soak zs -o results -t my_template.html data/*.txt

# Use multiple templates (creates results_pipeline.html and results_simple.html)
uv run soak zs -o results -t pipeline.html -t simple.html data/*.txt

# Process ZIP archive
uv run soak zs -o results interviews.zip

# Process CSV spreadsheet (each row becomes a document)
uv run soak classifier_tabular -o results soak/data/test_data.csv

# Sample first 10 rows from spreadsheet
uv run soak classifier_tabular --head 10 -o results data/survey.xlsx

Pipeline Resolution:

soak looks for pipeline files in this order:

./PIPELINE (exact path)
./PIPELINE.soak
./PIPELINE.yml
soak/pipelines/PIPELINE
soak/pipelines/PIPELINE.soak
soak/pipelines/PIPELINE.yml

Input Files:

Supports: .txt, .pdf, .docx, .csv, .xlsx, .zip
Glob patterns: data/*.txt, **/*.docx
CSV/XLSX: Each row becomes a separate document with columns accessible as ``
ZIP files: Automatically extracted to temp directory
Multiple files processed in parallel

Output:

Output files are always written to disk (never to stdout). All files go into a dump folder:

Without --output (pipeline zs.soak):

zs_dump/
├── zs.json              # Full pipeline data
├── zs_pipeline.html     # Rendered view (default template)
├── 01_Split_chunks/     # Per-node execution details
├── 02_Map_codes/
└── ...

With --output results:

results_dump/
├── results.json
├── results_pipeline.html
├── results_simple.html  # If -t simple specified
└── ...

Conflict handling:

Before running the pipeline, soak checks if the output folder already exists
If output folder exists with JSON but no --force: generates only new templates (template-only mode)
If output folder exists with --force: warns and overwrites everything
Template-only mode allows adding new templates without re-running the pipeline

Model Aliases

Pipelines can define model aliases in default_config.models to assign different models to different roles. The --model flag overrides these aliases at runtime.

Syntax:

Form	Effect
`--model gpt-4.1`	Sets the `default` alias
`--model alias=model`	Sets a specific alias

Example pipeline definition:

default_config:
  models:
    default: gpt-4.1-mini    # routine tasks
    best: gpt-4.1            # complex reasoning
  embeddings: text-embedding-3-large

nodes:
  - name: extract_codes
    type: Map
    model: default           # uses 'default' alias

  - name: synthesize
    type: Transform
    model: best              # uses 'best' alias

Override from CLI:

# Set default model only
uv run soak zs data/*.txt --model gpt-4.1

# Override specific aliases
uv run soak zs data/*.txt --model default=claude-3-5-sonnet --model best=claude-3-opus

# Multiple aliases
uv run soak zs data/*.txt \
  --model default=gpt-4.1-mini \
  --model best=gpt-4.1 \
  --model cheap=gpt-4.1-mini

See Model Aliases for full documentation.

compare

Compare analyses or string lists and generate comparison statistics.

Two modes:

JSON mode: Compare QualitativeAnalysis JSON files from pipeline runs
Strings mode: Compare columns of strings from XLSX/CSV files (useful for comparing theme lists)

# JSON mode
uv run soak compare results1.json results2.json -o comparison.html

# Strings mode
uv run soak compare --strings themes.xlsx --cols "A,B,C" -o comparison.html

Key Options:

Option	Short	Description
`--output PATH`	`-o`	Output file (.html for report, .txt for text stats)
`--strings PATH`	`-s`	XLSX/CSV file with columns to compare (enables strings mode)
`--cols NAMES`	`-c`	Comma-separated column names (default: `A,B`). All pairwise combinations computed.
`--threshold FLOAT`		Similarity threshold for binary matching (default: 0.6)
`--similarity METRIC`	`-S`	Similarity metric: `angular` (default), `cosine`, `shepard`
`--ot-k FLOAT`		Optimal transport mass penalty K (default: 0.25). Lower = more selective.
`--embedding-model`		Embedding model (default: `text-embedding-3-large`). Use `local/model-name` for local models.

Examples:

# Compare two pipeline results
uv run soak compare results1.json results2.json -o comparison.html

# Compare three sets of themes from spreadsheet columns
uv run soak compare --strings themes.xlsx --cols "Method1,Method2,Method3" -o comparison.html

# Use local embeddings (no API key needed)
uv run soak compare --embedding-model local/all-MiniLM-L6-v2 *.json

# Text output only (no HTML)
uv run soak compare --strings data.xlsx -o stats.txt

Output:

HTML report with similarity heatmaps, optimal transport visualisations, and matching statistics
Console output shows key metrics (suppressed when outputting to HTML)

show

Display contents of built-in pipelines or templates.

uv run soak show TYPE [NAME]

Arguments:

TYPE - pipeline or template
NAME - Name of item to show (optional - lists all if omitted)

Examples:

# List all pipelines
uv run soak show pipeline

# List all templates
uv run soak show template

# Show a specific pipeline
uv run soak show pipeline zs

# Show a specific template
uv run soak show template default

# Save to file for customization
uv run soak show pipeline zs > my_analysis.soak
uv run soak show template default > my_template.html

Built-in Pipelines:

zs - Zero-shot thematic analysis
zspe - Pre-extraction thematic analysis
classifier - Multi-level classification with agreement analysis
classifier_tabular - Classification for spreadsheet data (CSV/Excel)
demo - Simple demonstration pipeline
verify - Quote verification workflow
test - Testing pipeline

Built-in Templates:

pipeline.html - Standard pipeline results view (default)
simple.html - Simplified results view
narrative.html - Narrative-focused output
comparison.html - Multi-run comparison view

Environment Variables

Variable	Description	Default
`LLM_API_KEY`	API key for LLM provider	Required
`LLM_API_BASE`	Base URL for API	`https://api.openai.com/v1`
`MAX_CONCURRENCY`	Max parallel LLM calls	`20`
`SOAK_MAX_RUNTIME`	Max pipeline runtime (seconds)	`1800` (30 minutes)

Set via:

export LLM_API_KEY=sk-...
export LLM_API_BASE=https://api.openai.com/v1

Or create .env file in working directory:

LLM_API_KEY=sk-...
LLM_API_BASE=https://api.openai.com/v1

Exit Codes

Code	Meaning
0	Success
1	General error (missing files, invalid arguments)
2	Pipeline validation error

Tips

View progress:

uv run soak zs -o results -v data/*.txt

Process results with jq:

uv run soak zs data/test.txt  # Creates zs.json
cat zs.json | jq '.codes'

Iterate on templates:

# Get default template
uv run soak show template default > my_template.html

# Edit my_template.html

# Use it
uv run soak zs data/*.txt -o results -t my_template.html

Add new templates to existing results:

# Run once
uv run soak zs -o results data/*.txt

# Add another template without re-running (template-only mode)
uv run soak zs -o results -t my_template.html data/*.txt

Process large datasets:

# Reduce concurrency to avoid rate limits
export MAX_CONCURRENCY=5
uv run soak zs large_dataset/*.txt -o results