Theme Similarity Metrics
Introduction
When comparing two sets of qualitative themes (e.g., human-generated vs LLM-generated, or multiple LLM runs), we need metrics that capture how well the theme sets align. This document explores three approaches to measuring theme similarity:
- Paper’s approach (Raza et al. 2025): Jaccard Similarity with n×m denominator
- Current implementation: Precision/Recall/F1 with bidirectional max-similarity
- Hungarian matching: Optimal one-to-one assignment
Each approach makes different assumptions about what constitutes “good alignment” and produces different scores for the same data.
The Paper’s Approach (Raza et al. 2025)
Reference: Raza et al. 2025 - LLM-TA Pipeline
Mathematical Definitions
Let H = {h₁, h₂, …, hₙ} represent human-generated themes, and L = {l₁, l₂, …, lₘ} represent LLM-generated themes.
For each pair (hᵢ, lⱼ) in H × L, compute a similarity score s(hᵢ, lⱼ) (typically cosine similarity of embeddings).
| Define Sθ = {(hᵢ, lⱼ) ∈ H × L | s(hᵢ, lⱼ) ≥ θ}, where θ is the similarity threshold. |
Jaccard Similarity:
Jaccard = |Sθ| / (n × m)
Where:
-
Sθ = number of theme pairs with similarity ≥ threshold - n × m = total possible pairs
Hit Rate:
Hit Rate = |Hₛ| / n
| Where Hₛ = {h ∈ H | ∃ l ∈ L, s(h, l) ≥ θ} (themes in H that have at least one match in L) |
What This Measures
Jaccard Similarity in this formulation measures the density of semantic overlap across the entire theme space. It answers: “What fraction of all possible thematic relationships are similar?”
Hit Rate measures coverage: what proportion of one set finds representation in the other.
Key Characteristics
- Many-to-many matching: One theme can match multiple themes in the other set
- Density-based: Numerator counts all pairs above threshold, not unique themes
- Size-dependent: Larger theme sets produce lower Jaccard scores even with perfect one-to-one alignment
Evaluation Methods
Raza et al. use four parallel evaluation methods:
- all-MiniLM-L6-v2 (embedding) → cosine similarity
- all-mpnet-base-v2 (embedding) → cosine similarity
- sentence-T5-xxl (embedding) → cosine similarity
- LLM judge (GPT-based) → similarity score 0-1
For each method, they:
- Generate pairwise similarity scores for all H × L theme pairs
- Convert to binary using threshold θ
- Calculate Jaccard and Hit Rate separately per method
- Never aggregate or average across methods
The LLM judge assigns similarity scores based on conceptual overlap, with penalties for specificity mismatches (one theme very specific, other very general).
Critical Note: Mathematical Properties of Cosine Similarity
Before discussing implementation choices, it’s essential to understand fundamental limitations of cosine similarity as a metric.
Cosine Similarity is Not a True Metric
Violates triangle inequality (Schubert 2021): Cosine distance (1 − cosine similarity) does not satisfy the triangle inequality property fundamental to distance metrics. Concrete counterexample: d₃ = 0.3562 > d₁ + d₂ = 0.2286, violating d(x,z) ≤ d(x,y) + d(y,z).
Angular distance is the proper metric: The angle θ = arccos(cosine) forms a true metric on the unit sphere, but cosine values themselves do not.
Can Yield Arbitrary Results
Embedding instability (Rendle et al. 2024): Research shows “cosine similarity of learned embeddings can in fact yield arbitrary results.” Different L2 regularizations produce identical prediction models but vastly different cosine similarities – rendering comparisons potentially meaningless.
Loss of semantic information (arXiv:2509.19323): Cosine similarity discards magnitude information entirely, yet vector magnitude can encode meaningful semantic properties (specificity, importance). Additionally, representation collapse in modern embeddings means “even semantically disparate sentences can exhibit high cosine similarity.”
Nonlinear Relationship to Semantic Distance
Cosine values have a nonlinear relationship to actual semantic distance:
- Small changes near 1.0 (cos 0° → cos 10°: 1.0 → 0.985) represent tiny semantic shifts
- Same magnitude changes near 0.5 (cos 60° → cos 70°: 0.5 → 0.342) represent much larger shifts
- The function θ = arccos(cos θ) reveals the nonlinearity
This means: 0.9 → 0.95 is semantically very different from 0.3 → 0.35, despite both being 0.05 increases.
Why Averaging is Problematic
Not a proper measurement scale:
Cosine similarity is a bounded similarity coefficient, not a ratio or interval scale:
- Not ratio scale: No true zero (0 = orthogonal, not “no similarity”); 0.8 ≠ “twice as similar” as 0.4
- Not interval scale: Differences are nonlinear and not homogeneous across the scale
- 0.9→0.8 = 11.1° angular change
- 0.3→0.2 = 7.2° angular change
- Same numerical difference (0.1) ≠ same semantic/geometric distance
- Bounded affine-ish: Monotonic with angular distance but no equal-interval guarantee
Implication: Averaging cosine similarities lacks mathematical foundation. You’re averaging nonlinear transformations of angles.
Practical consequences:
mean([0.95, 0.95, 0.30]) = 0.73 # Suggests "moderate similarity"
# But reality: 2/3 themes have excellent matches, 1/3 has poor match
# The mean masks the bimodal distribution AND loses geometric meaning
Rigorous alternative: Average angular distances θ = arccos(cosine), which IS a proper metric.
What Raza et al. do instead:
- Threshold to binary (≥ θ → 1, else 0)
- Count proportions (which ARE ratio scale)
- Report methods separately (never average across embedding models or judge scores)
This approach converts similarity to a proper ratio scale (proportions/percentages) before aggregation.
Current Implementation
Our implementation in soak/comparators/similarity_comparator.py calculates:
Metrics Computed
-
Match Matrix: Binary matrix where
match_matrix[i,j] = 1ifsimilarity[i,j] ≥ threshold - Recall (equivalent to paper’s Hit Rate):
recall = (number of A themes with ≥1 match in B) / |A| - Precision (symmetric Hit Rate):
precision = (number of B themes with ≥1 match in A) / |B| - F1 Score:
f1 = 2 * (precision * recall) / (precision + recall) - Jaccard (paper’s formulation):
jaccard = match_matrix.sum() / (|A| × |B|) - Bidirectional Max-Similarity:
a_b_most_similar = mean(max similarity for each A theme across all B themes) b_a_most_similar = mean(max similarity for each B theme across all A themes) similarity_f1 = harmonic mean of above two⚠️ Note: This metric averages raw cosine similarity scores, which is mathematically questionable (see “Why Averaging is Problematic” above). While pragmatically common in practice, this violates measurement theory since cosine similarity is not a ratio scale. Consider using threshold-based metrics (precision/recall/F1) or reporting full distributions (min, Q1, median, Q3, max) instead.
Implementation Details
# From compare_result_similarity() function
sim_matrix = cosine_similarity(emb_A, emb_B) # |A| × |B| matrix
match_matrix = sim_matrix >= threshold # Binary matrix
# Recall: % of A themes with any match
recall = match_matrix.any(axis=1).sum() / len(emb_A)
# Precision: % of B themes with any match
precision = match_matrix.any(axis=0).sum() / len(emb_B)
# Jaccard: matches / all possible pairs
jaccard = match_matrix.sum() / match_matrix.size
Hungarian Matching Alternative
Concept
The Hungarian algorithm (also called Kuhn-Munkres) finds the optimal one-to-one assignment between two sets that maximizes total similarity.
Unlike the many-to-many approaches above, this enforces that each theme maps to at most one theme in the other set.
Algorithm
from scipy.optimize import linear_sum_assignment
import numpy as np
# Convert similarity to cost (algorithm minimizes)
cost_matrix = 1 - similarity_matrix
# Find optimal assignment
row_indices, col_indices = linear_sum_assignment(cost_matrix)
# Extract matched pairs above threshold
matched_pairs = [
(i, j, similarity_matrix[i, j])
for i, j in zip(row_indices, col_indices)
if similarity_matrix[i, j] >= threshold
]
Metrics from Hungarian Matching
- Mean Matched Similarity: Average similarity of optimal pairs
- Coverage A: Proportion of A themes in matched pairs above threshold
- Coverage B: Proportion of B themes in matched pairs above threshold
- Hungarian F1: Harmonic mean of Coverage A and Coverage B
- True Jaccard (optional): Using matched themes:
intersection = number of matched pairs above threshold union = |A| + |B| - intersection jaccard = intersection / union
What This Measures
Hungarian matching measures optimal alignment quality under one-to-one constraint. It answers: “If each theme can match only one counterpart, what’s the best possible alignment?”
Concrete Examples
Example 1: Perfect 5×5 Alignment
Setup:
- Set A: 5 themes [A1, A2, A3, A4, A5]
- Set B: 5 themes [B1, B2, B3, B4, B5]
- Similarity matrix: Perfect diagonal (each Aᵢ matches only Bᵢ with similarity 0.85)
- Threshold: 0.7
Similarity Matrix:
B1 B2 B3 B4 B5
A1 0.85 0.30 0.25 0.20 0.15
A2 0.30 0.85 0.30 0.25 0.20
A3 0.25 0.30 0.85 0.30 0.25
A4 0.20 0.25 0.30 0.85 0.30
A5 0.15 0.20 0.25 0.30 0.85
Results:
| Metric | Paper’s Jaccard | Current Recall/Precision | Hungarian |
|---|---|---|---|
| Primary score | 5/25 = 0.20 | F1 = 1.0 | Coverage F1 = 1.0 |
| Interpretation | Only 20% of pairs similar | Perfect coverage both ways | Perfect optimal alignment |
| Coverage A | N/A | 5/5 = 1.0 | 5/5 = 1.0 |
| Coverage B | N/A | 5/5 = 1.0 | 5/5 = 1.0 |
| Mean similarity | N/A | 0.85 (bidirectional) | 0.85 (optimal pairs) |
| “True” Jaccard | 0.20 | N/A | 5/(5+5-5) = 1.0 |
Analysis:
- Paper’s Jaccard gives 0.20 for perfect alignment because only 5 out of 25 possible pairs are similar
- Current metrics (F1 = 1.0) correctly identify perfect coverage
- Hungarian gives perfect scores and can compute proper Jaccard = 1.0
Example 2: Perfect 10×10 Alignment
Setup:
- Same as Example 1 but with 10 themes in each set
- Perfect diagonal alignment
Results:
| Metric | Paper’s Jaccard | Current F1 | Hungarian |
|---|---|---|---|
| Primary score | 10/100 = 0.10 | 1.0 | Coverage F1 = 1.0 |
| “True” Jaccard | 0.10 | N/A | 10/(10+10-10) = 1.0 |
Analysis:
- Paper’s Jaccard gets worse (0.10 vs 0.20) despite identical quality
- Shows size-dependency problem: larger sets → lower scores even with perfect alignment
- Current F1 and Hungarian both correctly give 1.0
Key Insight: Paper’s Jaccard penalizes having more themes, which is problematic if theme count varies.
Example 3: Asymmetric Sets (5A × 10B)
Setup:
- Set A: 5 themes
- Set B: 10 themes
- All 5 A themes match something in B
- Only 5 of 10 B themes got matched
- Threshold: 0.7
Results:
| Metric | Paper’s Jaccard | Current Metrics | Hungarian |
|---|---|---|---|
| Primary score | 5/50 = 0.10 | Recall=1.0, Precision=0.5, F1=0.67 | Coverage F1=0.67 |
| Coverage A | N/A | 5/5 = 1.0 | 5/5 = 1.0 |
| Coverage B | N/A | 5/10 = 0.5 | 5/10 = 0.5 |
| “True” Jaccard | 0.10 | N/A | 5/(5+10-5) = 0.50 |
Analysis:
- Paper’s Jaccard (0.10) severely underestimates quality
- Current F1 (0.67) balances full A coverage with partial B coverage
- Hungarian F1 (0.67) gives same result with one-to-one constraint
- “True” Jaccard (0.50) reflects that half the themes in the union were matched
Interpretation: Set A is fully represented in B, but B has redundant/extra themes. This is captured better by F1 than paper’s Jaccard.
Example 4: Many-to-Many Overlapping Themes
Setup:
- Set A: 3 themes [A1, A2, A3]
- Set B: 3 themes [B1, B2, B3]
- A1 and A2 both match B1 and B2 (overlapping concepts)
- A3 matches only B3
- Threshold: 0.7
Similarity Matrix:
B1 B2 B3
A1 0.80 0.75 0.40
A2 0.75 0.80 0.30
A3 0.20 0.30 0.85
Match Matrix (≥0.7):
B1 B2 B3
A1 1 1 0
A2 1 1 0
A3 0 0 1
Results:
| Metric | Paper’s Jaccard | Current Metrics | Hungarian |
|---|---|---|---|
| Primary score | 5/9 = 0.56 | F1 = 1.0 | Coverage F1 = 1.0 |
| Matched pairs | 5 pairs | All themes covered | 3 optimal pairs |
| Mean similarity | N/A | 0.775 (bidirectional) | 0.817 (optimal) |
| Optimal assignment | N/A | N/A | A1→B1 (0.80), A2→B2 (0.80), A3→B3 (0.85) |
Analysis:
- Paper’s Jaccard (0.56) reflects many-to-many matching (5 pairs out of 9 possible)
- Current F1 (1.0) shows perfect coverage – every theme finds a match
- Hungarian finds globally optimal one-to-one alignment with higher mean similarity (0.817)
Key Insight: When themes overlap conceptually (A1 and A2 both match B1, B2), different approaches give different answers:
- Paper’s Jaccard: counts all 5 similar pairs
- Current F1: just checks coverage (all themes covered = 1.0)
- Hungarian: picks best one-to-one mapping (A1→B1, A2→B2 rather than A1→B2, A2→B1)
Example 5: Partial Mismatch
Setup:
- Set A: 5 themes
- Set B: 5 themes
- Only 3 themes align (diagonal positions 1,2,3)
- Themes 4 and 5 don’t match anything
- Threshold: 0.7
Match Matrix:
B1 B2 B3 B4 B5
A1 1 0 0 0 0
A2 0 1 0 0 0
A3 0 0 1 0 0
A4 0 0 0 0 0
A5 0 0 0 0 0
Results:
| Metric | Paper’s Jaccard | Current Metrics | Hungarian |
|---|---|---|---|
| Primary score | 3/25 = 0.12 | Recall=0.6, Precision=0.6, F1=0.60 | Coverage F1=0.60 |
| Coverage A | N/A | 3/5 = 0.6 | 3/5 = 0.6 |
| Coverage B | N/A | 3/5 = 0.6 | 3/5 = 0.6 |
| “True” Jaccard | 0.12 | N/A | 3/(5+5-3) = 0.43 |
Analysis:
- Paper’s Jaccard (0.12) very low despite 60% coverage
- Current F1 (0.60) accurately reflects 60% coverage on both sides
- Hungarian F1 (0.60) gives same result
- “True” Jaccard (0.43) is higher than paper’s because denominator is 7 not 25
Key Insight: All approaches agree on relative quality (partial match), but paper’s Jaccard produces much lower absolute scores.
Summary Comparison Table
| Scenario | Paper Jaccard | Current F1 | Hungarian F1 | “True” Jaccard |
|---|---|---|---|---|
| Perfect 5×5 | 0.20 | 1.0 | 1.0 | 1.0 |
| Perfect 10×10 | 0.10 | 1.0 | 1.0 | 1.0 |
| Asymmetric 5×10 | 0.10 | 0.67 | 0.67 | 0.50 |
| Many-to-many 3×3 | 0.56 | 1.0 | 1.0 | 1.0 |
| Partial 5×5 | 0.12 | 0.60 | 0.60 | 0.43 |
Key Observations:
- Paper’s Jaccard is systematically low and cannot reach 1.0 for one-to-one alignment
- Size dependency: Perfect 10×10 scores lower than perfect 5×5
- Current F1 and Hungarian F1 are similar but Hungarian uses optimal one-to-one matching
-
“True” Jaccard (with union = A + B -intersection) gives interpretable results
When to Use Each Approach
Use Paper’s Jaccard When:
- Replicating published methodology for comparability
- Expecting dense many-to-many theme relationships
- Want to measure “semantic overlap density” across theme space
- Theme redundancy/overlap is important to capture
Limitations:
- Cannot reach 1.0 for perfect one-to-one alignment
- Penalizes larger theme sets
- Non-intuitive scale (hard to interpret absolute values)
Use Current Precision/Recall/F1 When:
- Want to measure coverage: what proportion of themes find matches
- Don’t care about one-to-one constraint (themes can match multiple counterparts)
- Need interpretable metrics (1.0 = perfect coverage)
- Want bidirectional coverage assessment
Strengths:
- Intuitive interpretation
- Handles asymmetric set sizes well
- F1 = 1.0 means all themes (both sets) found matches
Limitations:
- Doesn’t distinguish between one-to-one and many-to-many matching
- Doesn’t consider match quality beyond threshold
- Can give 1.0 even with weak matches (just above threshold)
Use Hungarian Matching When:
- Conceptual model is “each theme should match one counterpart”
- Want globally optimal one-to-one alignment
- Need match quality metrics (mean similarity of best pairs)
- Comparing analyses where you expect similar theme counts
- Want both coverage and quality assessment
Strengths:
- Enforces one-to-one constraint (no double-counting)
- Finds globally optimal assignment
- Provides interpretable quality metrics
- Can compute “true” Jaccard with proper denominator
Limitations:
- Forces one-to-one even when many-to-many might be appropriate
- More complex to implement and explain
- Assumes sets should be roughly same size
Recommendations
For Your Use Case
Given your assumption that “perfect alignment = each theme in A has exactly one match in B”:
Primary recommendation: Add Hungarian matching alongside current metrics
Implement Hungarian algorithm to provide:
- Coverage metrics: Proportion of themes matched above threshold (primary metric)
- Hungarian F1: Harmonic mean of Coverage A and Coverage B
- “True” Jaccard:
matched / (|A| + |B| - matched)for proper Jaccard interpretation - Distribution of match quality: Report min, Q1, median, Q3, max of matched pair similarities (instead of mean)
⚠️ Avoid averaging raw cosine similarities due to non-ratio-scale properties. If continuous scores are needed, report full distribution or use percentile-based metrics (e.g., “% of themes with similarity > 0.8”).
Keep existing metrics:
- Current F1: Still useful for quick coverage assessment (threshold-based, no averaging)
- Bidirectional max-similarity: Provides quality estimate but note averaging limitations
- Paper’s Jaccard: Keep for methodological comparison/replication
Consider adding:
- Per-method reporting: Like Raza et al., report results separately for each embedding model
- Percentile metrics: “% themes above 0.7”, “% themes above 0.8”, etc.
- Angular distance: Use arccos(cosine) for a proper metric (if mathematically needed)
Implementation Strategy
def hungarian_matching(
similarity_matrix: np.ndarray,
threshold: float = 0.7
) -> Dict[str, Any]:
"""Compute optimal one-to-one theme matching."""
from scipy.optimize import linear_sum_assignment
n_A, n_B = similarity_matrix.shape
# Pad to square matrix if needed
size = max(n_A, n_B)
sim_padded = np.zeros((size, size))
sim_padded[:n_A, :n_B] = similarity_matrix
# Find optimal assignment (minimize cost = maximize similarity)
cost = 1 - sim_padded
row_ind, col_ind = linear_sum_assignment(cost)
# Extract real pairs (not padding)
pairs = [
(i, j, similarity_matrix[i, j])
for i, j in zip(row_ind, col_ind)
if i < n_A and j < n_B
]
# Filter by threshold
matched_pairs = [p for p in pairs if p[2] >= threshold]
# Compute metrics
coverage_A = len(matched_pairs) / n_A if n_A > 0 else 0.0
coverage_B = len(matched_pairs) / n_B if n_B > 0 else 0.0
hungarian_f1 = (
2 * coverage_A * coverage_B / (coverage_A + coverage_B)
if (coverage_A + coverage_B) > 0 else 0.0
)
# "True" Jaccard
intersection = len(matched_pairs)
union = n_A + n_B - intersection
true_jaccard = intersection / union if union > 0 else 0.0
# Distribution of match quality (instead of mean)
similarities = [p[2] for p in matched_pairs]
if similarities:
similarity_dist = {
"min": float(np.min(similarities)),
"q1": float(np.percentile(similarities, 25)),
"median": float(np.median(similarities)),
"q3": float(np.percentile(similarities, 75)),
"max": float(np.max(similarities)),
}
else:
similarity_dist = {"min": 0, "q1": 0, "median": 0, "q3": 0, "max": 0}
return {
"matched_pairs": matched_pairs,
"similarity_distribution": similarity_dist,
"coverage_A": coverage_A,
"coverage_B": coverage_B,
"hungarian_f1": hungarian_f1,
"true_jaccard": true_jaccard,
"n_matched": len(matched_pairs),
"n_total_pairs": len(pairs),
}
Reporting Strategy
When comparing theme sets, report:
- Hungarian F1 as primary metric (one-to-one alignment quality)
- Current F1 for coverage (allows many-to-many)
- Similarity distribution from Hungarian (median and quartiles, not mean)
- Coverage metrics (A and B separately) to detect asymmetry
- Paper’s Jaccard in parentheses for methodological comparison
Example output:
Theme Alignment: Hungarian F1 = 0.85, Coverage F1 = 0.92
- Optimal one-to-one: 17/20 themes matched
- Match quality: median=0.78 (Q1=0.71, Q3=0.84, range: 0.65-0.91)
- Coverage: 17/20 A themes (85%), 17/19 B themes (89%)
- Paper's Jaccard: 0.23 (for comparison)
This gives users:
- One-to-one quality (Hungarian) matching your conceptual model
- Many-to-many coverage (current F1) for flexibility
- Distribution not mean (avoids ratio-scale violation)
- Methodological comparability (paper’s Jaccard)
Handling the Current “Jaccard” Metric
Options (in order of recommendation):
- Rename to
match_density: Keep calculation, change name to reflect what it actually measures - Add
hungarian_jaccardortrue_jaccard: Compute proper Jaccard from Hungarian matching - Keep both: Report paper’s formula as
jaccard_raza2025and new one asjaccard
This maintains backward compatibility while adding clarity.
Ideas for Future Exploration
This section presents conceptual approaches not yet implemented, intended to facilitate expert discussion about alternative metrics for theme similarity.
The Gap-Based Distinctiveness Problem
Core question: Does a theme have a clear unique match, or is it ambiguous between multiple candidates?
Current approach: Use absolute threshold (e.g., similarity ≥ 0.7) to define matches. But this doesn’t distinguish between:
- Case A: Best match = 0.85, second-best = 0.45 (clear winner)
- Case B: Best match = 0.85, second-best = 0.82 (ambiguous)
Both pass the threshold, but Case A suggests a unique match while Case B suggests multiple plausible candidates.
Proposed solution: Measure the gap between best and second-best match.
Pragmatic Justification
Why this approach is useful (despite limited mathematical rigor):
Empirical track record:
- Gap-based methods widely successful in ML and computer vision
- Lowe’s ratio test, silhouette coefficient, confidence margins all rely on similar logic
- Empirically correlates with match robustness
Practical advantages over averaging:
- Preserves information about distinctiveness (unique vs ambiguous)
- More interpretable: “20% better than runner-up” vs “average similarity 0.73”
- Common engineering heuristic with proven utility
Mathematical limitations:
- Cosine similarity is not an interval scale – differences are nonlinear (as shown earlier)
sim_best - sim_secondis a heuristic, not a mathematically grounded metric difference- For rigorous treatment: use angular distance gaps
θ_best - θ_second(see below)
Trade-off: Raw cosine gaps are simpler and familiar; angular gaps are more principled. For most applications, cosine gaps adequate.
Parallels in ML and Computer Vision
Lowe’s Ratio Test (SIFT feature matching, 2004):
ratio = distance_best / distance_second
if ratio < 0.8: # Best is >25% better
accept_match()
Used extensively in computer vision to reject ambiguous correspondences. A distinctive feature match should have a large gap to the second-best candidate.
Silhouette Coefficient (clustering quality):
a = cohesion (within-cluster distance)
b = separation (to nearest different cluster)
silhouette = (b - a) / max(a, b)
Measures how well-separated clusters are. Analogously, for themes:
distinctiveness = (sim_best - sim_second) / max(sim_best, sim_second)
Confidence Margin (classification, active learning):
margin = prob_top_class - prob_second_class
Small margin indicates uncertain prediction. In active learning, these are prioritized for labeling. For theme matching, small margins suggest ambiguous matches requiring expert review.
Proposed Gap-Based Metrics
- Absolute gap:
gap = sim_best - sim_second- Simple, intuitive
- Issue: Same gap means different things at different similarity levels
- Relative gap:
gap_rel = (sim_best - sim_second) / sim_best- Percentage improvement over second-best
- “Best match is 20% better than runner-up”
- Normalized gap:
gap_norm = (sim_best - sim_second) / max(sim_best, sim_second)- Silhouette-style, range [0,1]
- 1.0 = maximally distinctive
Rigorous Alternative: Angular Distance
For mathematical rigor, use angular distance instead of raw cosine similarity:
θ = arccos(cosine_similarity) # Convert to angle in radians or degrees
Properties:
- True metric: Satisfies triangle inequality
- Proper interval scale: Equal differences = equal geometric distances on unit sphere
- Range: [0°, 180°] for general vectors; [0°, 90°] for non-negative embeddings
Example (demonstrating nonlinearity of cosine):
sim = 0.9 → θ = 25.8°
sim = 0.8 → θ = 36.9°
gap = 11.1° (geometrically meaningful)
sim = 0.3 → θ = 72.5°
sim = 0.2 → θ = 78.5°
gap = 6.0° (smaller angular change despite same numerical difference!)
Gap-based metrics using angular distance:
- Absolute gap:
θ_best - θ_second(true metric difference) - Relative gap:
(θ_best - θ_second) / θ_best - All approaches below can use θ instead of raw cosine
Trade-offs:
- Pros: Mathematically principled, proper metric, interpretable as geometry
- Cons: Slightly more computation (arccos), less familiar to practitioners, need to convert back for display
Recommendation: Use angular distance if mathematical rigor matters (academic work, methodological development). Use raw cosine for practical applications where engineering heuristics suffice.
Two-Condition Matching
Combine quality (absolute similarity) with distinctiveness (gap):
Quality: sim_best ≥ 0.7
Distinctiveness: gap ≥ 0.15 (15% improvement)
Results in 4 categories:
- High quality + high gap → "confident unique match"
- High quality + low gap → "ambiguous (multiple candidates)"
- Low quality + high gap → "clear but poor match"
- Low quality + low gap → "no good match"
This provides richer information than binary accept/reject.
Aggregate Metrics
- Unique match rate: Proportion of themes with distinctive best match
- Gap distribution: Median, quartiles of gap values (not mean, per earlier discussion)
- Match confidence profile: Distribution across 4 categories above
The Within-Set Baseline Problem
Deeper question: Should match thresholds be adaptive based on intrinsic similarity structure?
Observation: Themes within set A may have natural conceptual overlap. Example:
A1: "Rural healthcare barriers"
A2: "Transportation to medical facilities"
A3: "Telemedicine adoption"
All address healthcare access → naturally ~0.65 similar to each other
Issue: If we find B1 with similarity 0.70 to A1:
- Absolute view: 0.70 passes threshold → match
- Relative view: 0.70 vs baseline 0.65 → only 0.05 improvement → weak signal
Key insight: The within-set similarity distribution provides a baseline or null model against which to judge cross-set matches.
Statistical Framing
This is analogous to:
- Permutation testing: Is observed similarity better than expected by chance?
- Signal-to-noise ratio: Signal (best match) vs noise (typical within-set similarity)
- Effect size (Cohen’s d): Mean difference relative to within-group variance
- Isotropy correction in embeddings: Adjusting for the “narrow cone” problem
Four Approaches to Baseline Calibration
Approach A: Baseline-Adjusted Similarity
Concept: Subtract theme’s typical within-set similarity from cross-set scores
adjusted_sim = sim(A_i, B_k) - mean(sim(A_i, A_j) for j≠i)
Interpretation:
adjusted > 0: B match better than typical A similarityadjusted > 0.2: Substantially better than baseline
Parallel: Background subtraction in signal processing, image analysis
Pros: Simple, intuitive Cons: Doesn’t account for variance in within-set similarities
Rigor note: Can apply to raw cosine (pragmatic) or angular distance (mathematically principled)
Approach B: Z-Score Normalization ⭐
Concept: Normalize by both mean AND standard deviation of within-set similarity
z = (sim_cross - μ_within) / σ_within
Interpretation:
z > 2: Match is >2 standard deviations above baseline (statistically significant)z > 3: Highly significant matchz < 0: Match is worse than typical within-set similarity
Parallels:
- Cohen’s d (effect size in statistics)
- Isotropy correction (Mu & Viswanath 2018; Ethayarajh 2019) for embedding anisotropy
- Standard practice in hypothesis testing
Pros:
- Standard statistical interpretation
- Accounts for both location and spread
- Handles heterogeneous within-set structure
- Puts all themes on comparable scale
Cons:
- Requires sufficient within-set samples to estimate σ
- Assumes approximately normal within-set distribution
Why recommended: Most statistically principled; widely understood by ML practitioners
Rigor note: Z-scores of raw cosine are “heuristic normalization” (nonlinear scale). For full rigor, compute z-scores on angular distances θ. Trade-off: cosine z-scores simpler and often adequate.
Approach C: Relative Gap Distinctiveness
Concept: Compare cross-set gap to typical within-set gaps
within_gap = max(sim(A_i, A)) - second_max(sim(A_i, A))
cross_gap = max(sim(A_i, B)) - second_max(sim(A_i, B))
relative_gap = cross_gap / within_gap
Interpretation:
relative_gap > 1: B match is MORE distinctive than A’s structure suggestsrelative_gap < 1: B match is LESS distinctive than typical within-A patterns
Parallel: Gap statistics in clustering (Tibshirani et al. 2001)
Pros:
- Captures distinctiveness structure, not just central tendency
- Adaptive to each theme’s specific pattern
Cons:
- More complex to interpret
- Undefined if within_gap = 0
Rigor note: Can apply to raw cosine or angular distance gaps
Approach D: Percentile Ranking
Concept: Rank best B match against distribution of within-A similarities
within_sims = [sim(A_i, A_j) for j≠i]
best_B = max(sim(A_i, B_k) for k)
percentile = percentile_rank(within_sims, best_B)
Interpretation:
- 95th percentile: Better than 95% of within-A similarities
- 50th percentile: Typical A-level similarity (not distinctive)
Parallel: Non-parametric hypothesis testing
Pros:
- No distributional assumptions
- Intuitive interpretation
- Robust to outliers
Cons:
- Requires sufficient within-set samples
- Less sensitive to tail behavior than z-scores
Rigor note: Works with raw cosine or angular distance; ranking preserves order either way
Comparison of Approaches
| Method | Accounts for Mean | Accounts for Variance | Statistical Interpretation | Computational Cost | Robustness |
|---|---|---|---|---|---|
| Baseline-adjusted | ✓ | ✗ | Simple improvement | Low | High |
| Z-score | ✓ | ✓ | Significance testing | Low | Medium |
| Relative gap | ✓ | Partial | Comparative structure | Medium | Medium |
| Percentile | ✓ | ✓ (implicit) | Non-parametric rank | Medium | High |
Broader Statistical Precedents
Isotropy Correction in Embeddings:
Modern language models produce anisotropic embeddings – vectors occupy a narrow cone in embedding space, leading to artificially high cosine similarities even for unrelated items.
Standard correction (Mu & Viswanath 2018):
corrected_sim(x, y) = raw_sim(x, y) - mean(raw_sim(x, all_items))
The within-set baseline approach is a more sophisticated variant: rather than using global mean, use context-specific baseline (within-A distribution for each theme).
Permutation Testing / Null Models:
Common in network analysis and link prediction:
- Generate random baseline (permute edges, reshuffle labels)
- Compare observed statistic to null distribution
- Accept if significantly better than chance
Applied here: Within-A similarity acts as “null model” – what we’d expect for themes in the same semantic space. Cross-set matches should exceed this baseline.
Signal-to-Noise Ratio:
Used in signal processing, biosignals (EEG, fMRI), image analysis:
SNR = signal_power / noise_power
Analogously:
SNR_theme = best_match_similarity / typical_within_similarity
High SNR indicates match stands out clearly from background.
Open Questions for Expert Discussion
-
Symmetry: Should we also compute within-B baselines and require matches exceed BOTH A and B baselines?
-
Small set sizes: How to handle when A < 5? Insufficient samples for meaningful σ or percentile estimates. -
Integration with Hungarian matching: Could baseline-adjusted scores replace raw similarities in Hungarian algorithm?
-
Threshold selection: Can we replace fixed thresholds (0.7) with adaptive thresholds based on within-set statistics?
-
Multiple testing correction: If testing many themes, should we apply Bonferroni or FDR correction to z-scores?
- Anisotropy diagnosis: Can we quantify whether within-set high similarity is due to:
- Genuine semantic overlap (expected)
- Embedding space anisotropy (technical artifact)
-
Stability: How sensitive are these methods to outliers in within-set distribution?
-
Interpretability trade-offs: Z-scores have standard interpretation but may be less intuitive to non-statisticians than simple gaps.
- Rigor vs practicality: Should we use angular distance (mathematically rigorous) or raw cosine (simpler, more familiar)? Does the added mathematical rigor matter for qualitative research applications, or are engineering heuristics adequate?
Potential Applications
Adaptive Quality Control:
- Flag themes that don’t rise above their within-set baseline
- “A1 matched B3, but B3 is only 0.05 better than typical A similarities → review”
Domain-Aware Thresholding:
- High within-set similarity (specialized domain) → require larger gaps
- Low within-set similarity (diverse themes) → accept smaller gaps
- Automatically adapts to semantic structure
Confidence-Weighted Metrics:
- Weight matches by z-score or gap when computing aggregate statistics
- High-confidence matches count more than ambiguous matches
Handling Embedding Isotropy:
- Directly addresses the “narrow cone” problem in modern embeddings
- Makes comparisons more robust to technical artifacts
Methodological Comparison:
- Compare raw vs baseline-adjusted approaches on same data
- Quantify how much baseline correction changes match decisions
Relationship to Existing Metrics
Precision/Recall/F1: Binary (matched or not)
- Baseline correction could inform confidence weights
- E.g., weighted F1 where high-z matches count more
Paper’s Jaccard: Density of matches
- Could threshold using z-scores instead of raw similarities
- “Jaccard with z > 2” vs “Jaccard with raw > 0.7”
Hungarian matching: Optimal one-to-one assignment
- Could operate on baseline-adjusted or z-score matrix instead of raw similarities
- May find different optimal matching
Bidirectional max-similarity: Averages raw scores
- Should be replaced with gap-based or z-score metrics
- More principled measurement properties
Summary for Discussion
Two key ideas emerge:
- Gap-based distinctiveness: Measure confidence in match via gap to second-best
- Mathematically valid (differences for interval scales)
- Strong precedents (Lowe’s test, silhouette, margins)
- Distinguishes confident from ambiguous matches
- Baseline calibration: Adjust for within-set similarity structure
- Accounts for intrinsic semantic overlap
- Analogous to isotropy correction, signal-to-noise, effect size
- Z-score approach recommended for statistical properties
Both approaches address limitations of current metrics (averaging raw similarities, fixed thresholds) while connecting to established statistical and ML practices.
Next steps: Discuss with domain experts to determine:
- Which approach(es) best match conceptual model of theme matching
- Practical threshold values (e.g., z > 2 vs z > 1.5)
- Whether benefits justify added complexity
- Suitability for reporting to qualitative researchers
References
Raza, M. Z., Xu, J., Lim, T., Boddy, L., Mery, C. M., Well, A., & Ding, Y. (2025). “LLM-TA: An LLM-Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease.” arXiv:2502.01620. https://arxiv.org/abs/2502.01620
Schubert, E. (2021). “A Triangle Inequality for Cosine Similarity.” In Similarity Search and Applications: 14th International Conference, SISAP 2021. arXiv:2107.04071. https://arxiv.org/abs/2107.04071
Rendle, S., Krichene, W., Zhang, L., & Anderson, J. (2024). “Is Cosine-Similarity of Embeddings Really About Similarity?” arXiv:2403.05440. https://arxiv.org/abs/2403.05440
Stevens, S. S. (1946). “On the theory of scales of measurement.” Science, 103(2684), 677-680.
Hungarian Algorithm: Kuhn, H. W. (1955). “The Hungarian Method for the assignment problem.” Naval Research Logistics Quarterly, 2(1-2), 83-97.
Lowe, D. G. (2004). “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision, 60(2), 91-110.
Mu, J., & Viswanath, B. (2018). “All-but-the-Top: Simple and Effective Postprocessing for Word Representations.” In Proceedings of NAACL-HLT 2018.
Ethayarajh, K. (2019). “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.” In Proceedings of EMNLP 2019.