Understanding Calibration
Calibration transforms raw embedding similarity scores into interpretable values that correspond to semantic distance categories.
Why Calibration Is Needed
Raw embedding similarities have two problems:
-
Model-specific ranges: A score of 0.85 might mean “identical” for one model but “moderately similar” for another.
-
Compressed distributions: Modern embeddings cluster semantically related texts tightly. Even unrelated themes often have similarities above 0.6.
Calibration addresses both by mapping raw scores to a standardised scale.
The Calibration Scale
| Calibrated Score | Meaning |
|---|---|
| 0.9 | Same meaning (paraphrases) |
| 0.75 | Close meaning |
| 0.5 | Diverging (partial overlap) |
| 0.3 | Distant (weak relation) |
| 0.1 | Unrelated |
After calibration, a score of 0.75 means “close meaning” regardless of which embedding model you use.
How It Works
1. Generate Training Data
We generate paraphrases of real themes at five semantic distances:
- Same: Identical meaning, different wording
- Close: Core meaning preserved, different framing
- Diverging: Partial overlap, different focus
- Distant: Same domain, different construct
- Unrelated: No meaningful overlap
2. Compute Similarities
For each original-paraphrase pair, we compute angular similarity:
angular_sim = 1 - arccos(cosine_sim) / π
Angular similarity is a proper metric (satisfies triangle inequality) with range [0, 1].
3. Fit Monotonic Spline
A shape-constrained additive model (SCAM) learns the mapping from raw similarity to target values. The monotonicity constraint ensures higher raw scores always map to higher calibrated scores.
4. Apply to New Data
The fitted model transforms similarities in soak compare:
uv run soak compare results.json --embedding-model local/BAAI/bge-base-en-v1.5
# Bundled calibration automatically applied
Model Recommendations
| Use Case | Model | Accuracy |
|---|---|---|
| Best accuracy (API) | text-embedding-3-small | 83.5% |
| Best local | BAAI/bge-base-en-v1.5 | 79.6% |
| Fastest | all-MiniLM-L6-v2 | 77.8% |
See calibration-data/README.md for full comparison.
Technical Details
Target Value Selection
Targets (0.9, 0.75, 0.5, 0.3, 0.1) were empirically optimised to match natural embedding gaps, achieving uniform stretch ratios across the calibration curve.
Degrees of Freedom
We use df=7 for the spline, which produces a smooth S-curve that:
- Asymptotes at the edges rather than hitting bounds abruptly
- Preserves rank ordering for scores above 0.9
- Balances fit quality (MAE) with category discrimination
Validation
Calibrations are validated using grouped holdout (20% of papers held out), ensuring generalisation to unseen themes.