Ground Truth Validation
Validate your LLM classifications against ground truth labels using the Classifier node’s built-in metrics.
Overview
The Classifier node can automatically:
- Compare LLM predictions to ground truth labels from your data
- Calculate precision, recall, F1 scores (macro, micro, weighted, binary)
- Generate confusion matrices
- Support multiple models with inter-rater agreement
Quick Start
nodes:
- name: classify_utterances
type: Classifier
model_names: [gpt-4o-mini]
inputs: [documents]
ground_truths:
reflection: # LLM output field name
existing: reflection_exists # Ground truth column name
mapping:
yes: 1 # Map LLM outputs to GT values
no: 0
unclear: 0
Configuration
Basic Structure
ground_truths:
<llm_field_name>: # Must match LLM output field
existing: <gt_column> # Ground truth column in your data
mapping: # How to map LLM outputs
<llm_value>: <gt_value>
drop: [<values_to_exclude>] # Optional: exclude from metrics
Field Names
Critical: The ground truth config key must match the LLM output field name.
# Template outputs: [[pick:utterance_type|...]]
ground_truths:
utterance_type: # ✓ Matches output field
existing: question_exists
# utterance_classification: # ✗ Wrong - doesn't match
# existing: question_exists
Error if mismatch:
Field 'utterance_classification' not found in model outputs.
Available output fields: ['utterance_type', 'reflection']
Mapping Strategies
1. Binary Classification
Map multiple LLM categories to binary ground truth:
utterance_type:
existing: question_exists # Values: 1.0, 0.0, NaN
mapping:
open_question: 1 # Questions → 1
closed_question: 1
statement: 0 # Non-questions → 0
advice_suggestion: 0
other: 0
Result: 2-class confusion matrix (0, 1)
2. Multi-class with Wildcards
Collapse unmapped categories using "*":
utterance_type:
existing: question_subtype # Values: open, closed, facilitating, NaN
mapping:
open_question: open
closed_question: closed
"*": "*" # All others → "other" category
Result: 3-class confusion matrix (closed, open, other)
3. Exclude Categories
Drop predictions you don’t want to validate:
utterance_type:
existing: question_exists
mapping:
open_question: 1
closed_question: 1
statement: "" # Map to empty string
advice_suggestion: ""
other: ""
drop: [""] # Exclude empty string
Result: Only open_question and closed_question validated
Alternative with wildcard:
mapping:
open_question: 1
closed_question: 1
"*": "" # All unmapped → ""
drop: [""] # Then drop
Boolean Handling
YAML parses yes/no as booleans. This is handled automatically:
mapping:
yes: 1 # YAML parses as {True: 1, False: 0}
no: 0 # But matches LLM outputs "yes"/"no"
Supported variations: yes/Yes/YES/no/No/NO/true/True/TRUE/false/False/FALSE
To be explicit, use quotes:
mapping:
"yes": 1 # Stays as string key
"no": 0
Missing Values
Missing values (NaN, None) are handled as “NA” category:
# Ground truth: [1, 0, NaN, 1]
# Becomes: [1, 0, NA, 1]
To exclude missing:
drop: ["NA"]
To collapse to “other”:
mapping:
yes: 1
no: 0
"*": "*" # NaN → "other"
Output Files
Metrics CSV
ground_truth_metrics.csv:
field,model,ground_truth_column,mapping,n_samples,precision_macro,recall_macro,f1_macro,...
reflection,gpt-4o-mini,reflection_exists,"{True: 1, False: 0}",50,0.85,0.82,0.83,...
Confusion Matrices
confusion_matrix_<field>_<model>.csv:
# Confusion Matrix: reflection
# Model: gpt-4o-mini
# Sample size: 50
# Ground truth column: reflection_exists
# Mapping applied:
# yes -> 1
# no -> 0
#
True \ Predicted,0,1
0,38,2
1,3,7
Detailed JSON
ground_truth_metrics.json includes:
- Full confusion matrices
- Per-class precision/recall/F1
- Classification reports
- Configuration metadata
Metrics Explained
Macro F1
- Average F1 across all classes
- Equal weight to each class
- Use when: Rare classes as important as common ones
Micro F1
- Global average across all predictions
- Weighted by frequency (dominated by common classes)
- Equals overall accuracy
- Use when: Overall correctness matters most
Weighted F1
- Average F1 weighted by class support
- Balance between macro and micro
- Use when: Default choice for reporting
Binary F1
- F1 for positive class only (2-class problems)
- Uses highest-sorted label as positive (e.g., “1” in [“0”, “1”])
Common Patterns
Binary with Multiple Predictors
model_names: [gpt-4o-mini, gpt-4o, claude-3-5-sonnet]
ground_truths:
reflection:
existing: reflection_exists
mapping: {yes: 1, no: 0, unclear: 0}
Output: Metrics for each model + inter-rater agreement
Multi-class Collapsed
ground_truths:
question_type:
existing: question_subtype
mapping:
open_question: open
closed_question: closed
"*": other # statement/advice/etc → "other"
Exclude Non-applicable
ground_truths:
client_utterance:
existing: client_talk_type
mapping:
positive: positive
negative: negative
neutral: neutral
"*": ""
drop: ["", "NA"] # Exclude therapist utterances
Error Messages
Unmapped Predictions
Unmapped prediction value(s) found in field 'reflection':
Values: ['unclear']
Fix option 1 - Map to target values:
"unclear": 0
Option 2 - Drop/exclude from metrics:
mapping:
"unclear": ""
drop: [""]
Option 3 - Use wildcard:
"*": other
Unpredicted Ground Truth
WARNING: Ground truth categories never predicted in 'utterance_type/gpt-4o-mini':
Categories: ['facilitating', 'NA']
These will appear in confusion matrix with zero prediction counts.
To collapse these to 'other' category, add wildcard to mapping:
"*": "*" or "*": other
This is informational – metrics continue, but zero-count categories appear in matrix.
Duplicate Keys
YAML doesn’t allow duplicate keys – only the last one is kept:
ground_truths:
utterance_type: # First definition
existing: question_exists
mapping: {...}
utterance_type: # ✗ Overwrites first!
existing: question_subtype
mapping: {...}
Fix: Use unique names:
ground_truths:
utterance_type_binary:
existing: question_exists
utterance_type_subtype:
existing: question_subtype
Best Practices
- Match field names – Ground truth key must match LLM output field
- Use wildcards – Avoid explicit mapping of every category
- Drop vs map – Drop unwanted predictions, don’t force-map them
- Quote booleans – Use
"yes"and"no"if you want explicit control - Test incrementally – Start with one field, add more once working
- Check confusion matrix – Verify categories make sense before trusting metrics
- Multiple models – Compare different models on same ground truth
Full Example
name: mi_classification
nodes:
- name: classify
type: Classifier
model_names:
- gpt-4o-mini
- gpt-4o
inputs: [documents]
ground_truths:
# Binary: is it a question?
utterance_type:
existing: question_exists
mapping:
open_question: 1
closed_question: 1
"*": 0
# Multi-class: question subtype
question_subtype:
existing: question_subtype_gt
mapping:
open_question: open
closed_question: closed
"*": "*"
drop: ["NA"]
# Binary: reflection present?
reflection:
existing: reflection_exists
mapping:
"yes": 1
"no": 0
"unclear": 0
---#classify
[[pick:utterance_type|open_question,closed_question,statement,other]]
[[pick:question_subtype|open_question,closed_question]]
[[pick:reflection|yes,no,unclear]]