LLM Evaluation Suite (soak eval)
The eval suite is a small, repeatable benchmark for the LLM contract that soaking depends on. It exists because weaker models can silently mangle hash references, drop required fields, or produce semantically poor codes/themes — symptoms that don’t show up as errors but degrade results.
Three things live under one roof:
- Probes — fixed templates (
probe_consolidate.sd,probe_long_context.sd, etc.) run against any model, with deterministic checks (schema validity, hash fidelity). - Snapshots — frozen single-stage cases captured from real runs, replayable against any model.
- Reports — markdown + HTML side-by-side comparison tables.
The same underlying machinery (soak.evals.run_probe) is callable from the soak CLI, the soakresearch Django management commands, and the pytest suite (pytest -m llm tests/evals).
Quick start
# Cheapest possible probe against one model
soak eval run --probe schema --model gpt-5-mini
# Sweep multiple probes × multiple models in parallel
soak eval run --probe schema,consolidate \
--model gpt-5-mini,gpt-4.1-mini,claude-haiku-4-5 \
--max-concurrent 5
# Build the comparison report (markdown + HTML)
python -m soak.evals.report --out ./soaking-eval/llm_evals.md
# or
python scripts/eval_report.py
Outputs land in ./soaking-eval/ by default:
soaking-eval/results/<YYYY-MM-DD>.jsonl— append-only history, one line per (probe, model) callsoaking-eval/llm_evals.md— markdown comparison reportsoaking-eval/llm_evals.html— pandoc-rendered HTML (auto, if pandoc is onPATH)
Override the location with the SOAK_EVAL_RESULTS=/path env var or the --results flag on the report command.
soak eval subcommands
| Subcommand | Purpose |
|---|---|
soak eval run | Run probes against models, record metrics |
soak eval replay | Replay a snapshot against a model |
soak eval snapshot | Build a snapshot from a soak run JSON output |
soak eval list | List committed snapshots |
soak eval run — running probes
Specifying probes and models
soak eval run --probe schema --model gpt-5-mini
Both flags accept CSV lists for sweeps:
soak eval run \
--probe schema,consolidate,long_context \
--model gpt-5-mini,gpt-4.1-mini,claude-haiku-4-5
This runs the cartesian product (3 probes × 3 models = 9 calls).
Available probes
| Probe | What it tests | Cost |
|---|---|---|
schema | Single-call code extraction. Tool-calling smoke test. | Cheap (~$0.001) |
consolidate | Codes → reference-mode consolidation → themes. Hash fidelity end-to-end. | Moderate (~$0.01) |
long_context | Reference-mode consolidate from a 50-code fixture. Tests hash truncation under length. | Higher |
themes_long | Theme generation from a 50-code fixture. Tests Theme.code_hashes fidelity at scale. | Higher |
Run soak eval run --help for the full current list (it’s drawn from soak.evals.AVAILABLE_PROBES).
Reproducibility
Two flags control determinism:
soak eval run -p schema -m gpt-5-mini --seed 1 --temperature 0
--seed(default1) is the model sampling seed. Pinning it makes re-runs reproducible — same prompt + same seed → same output → same joblib cache key → comparable cost/latency between runs.--temperature(default0.0) for deterministic decoding on models that honor it.
The seed and temperature are recorded in each JSONL row so the report can show what produced the result.
Parallelism
soak eval run -p schema -m m1,m2,m3,m4,m5 --max-concurrent 5
--max-concurrent N (default 5) runs N (probe, model) pairs in parallel via a thread pool. struckdown’s internal MAX_CONCURRENCY semaphore continues to bound in-call concurrency separately.
Set --max-concurrent 1 for strictly serial execution if you’re debugging or hitting provider rate limits.
Credentials
soak eval run reads LLM_API_KEY and LLM_API_BASE from the environment, or accepts --api-key / --api-base. For the soakresearch web app there’s a sibling command, manage.py eval_probe, that resolves keys from the encrypted Credential table.
Recording
By default each run appends to ./soaking-eval/results/<YYYY-MM-DD>.jsonl. Pass --no-record to skip recording (e.g. for ad-hoc one-offs you don’t want polluting history).
Exit codes
0— every (probe, model) pair passedschema_validand any hash checks1— at least one pair failed (sweep continues, every pair is still recorded)2— bad arguments (unknown probe, missing API key, etc.)
Reports
# Markdown + HTML (HTML auto-generated if pandoc is available)
python scripts/eval_report.py
# Or via the Django command if you want it from soakresearch:
manage.py eval_report
Each row in the report is the most recent entry per (probe, model) across all *.jsonl files in ./soaking-eval/results/. Re-running a probe overwrites the previous entry in the report.
The report includes:
- One table per probe with metrics (schema-valid, hash refs, coverage, cost, latency)
- A collapsible Examples section per probe with model-by-model side-by-side comparison: codes show name + description + supporting quotes; themes show name + description + the resolved code names for each
code_hashreference.
Output format
- The default markdown uses HTML tables for the examples section so pandoc / GitHub render the wide side-by-side comparison cleanly.
- Run
python scripts/eval_report.py --no-htmlto skip the pandoc step. - The markdown uses real Unicode curly quotes (U+201C/201D) and box characters so the raw source is human-readable too.
Snapshots and replay
A snapshot is a frozen single-stage call: enough information to re-run one slot of one node against any chosen model, without standing up the full pipeline. Use snapshots to capture interesting cases from production runs and replay them against future model candidates.
Format
soaking-eval/snapshots/<name>/
template.sd # the prompt template (jinja, with [[type:slot]] markers)
inputs.json # context dict for complete()
expected.json # baseline outputs keyed by slot name
metadata.json # source ref, model, date, scrub status, schema version
README.md # why this case is interesting
Building a snapshot
From a soak run output JSON:
soak eval snapshot \
--from-cli /path/to/run_output.json \
--template /path/to/template.sd \
--as my-case --stage themes
From a soakresearch AnalysisRun (uses Django credentials, scrubs emails/phones by default):
manage.py snapshot_run <run_uuid> --as my-case --stage theme_groups
manage.py snapshot_run <run_uuid> --as my-case --stage theme_groups --keep-pii
Replaying
soak eval replay ./soaking-eval/snapshots/my-case --model gpt-5-mini
soak eval replay ./soaking-eval/snapshots/my-case --model gpt-4.1
Replay loads the snapshot, runs complete() against the named model, and prints the actual outputs alongside a summary diff against expected.json. No full pipeline is involved — this is intentionally cheap.
Listing snapshots
soak eval list
File layout summary
./
├── soaking-eval/ # auto-created in CWD
│ ├── results/
│ │ └── 2026-05-06.jsonl # one line per probe×model
│ ├── snapshots/
│ │ └── <name>/ # 5 files per snapshot
│ ├── llm_evals.md # report markdown
│ └── llm_evals.html # pandoc HTML
└── ...
To redirect: SOAK_EVAL_RESULTS=/path/to/dir env var, or pass --results /path to the report tool.
Pytest gate (pytest -m llm)
pytest -m llm tests/evals # gate: hard-asserts pass
pytest -m llm tests/evals --eval-mode # sweep: record only, no asserts
SOAK_EVAL_MODELS=gpt-5-mini,gpt-4.1-mini \
pytest -m llm tests/evals --eval-mode # sweep specific models
Two flavours of test:
test_*_gateruns against the pinned model inSOAK_GATE_MODEL(defaultgpt-5.1-mini) with hard assertions. Fails CI on prompt drift, schema bugs, struckdown changes.test_probe_sweepparametrises over every probe × every model inSOAK_EVAL_MODELS. Skipped unless--eval-mode(orSOAK_EVAL_MODE=1); records JSONL without asserting.
Adding a new probe
- Create
soak/evals/data/probe_<name>.sd(a struckdown template). - In
soak/evals/probes.py, add a_run_<name>_checks()function and register it inAVAILABLE_PROBESalongside the template name. - If the probe needs pre-loaded fixture data, add a
context_loadercallable that returns adictof jinja vars. - (Optional) add a per-probe formatter in
soak/evals/report.pyso the markdown table shows the right columns.