Working with Spreadsheet Data (CSV/XLSX)
soak can process CSV and Excel files directly, treating each row as a separate document. This is useful for survey data, coded transcripts, or any tabular data you want to analyze with LLMs.
Quick Start
# Process CSV file
soak run classifier_tabular data/responses.csv -o results
# Process Excel file
soak run my_pipeline data/survey.xlsx -o analysis
How It Works
When you provide a CSV or XLSX file as input:
- Each row becomes a document – soak creates one
TrackedItemper row - Columns become template variables – All column values are accessible in your pipeline templates as ``
- Content is in metadata – Unlike text files, the row data is stored in metadata, not in
content - Provenance is tracked – Each row gets a unique
source_idlikefilename__row_0,filename__row_1, etc.
Accessing Column Data in Templates
Example Data (responses.csv)
participant_id,age,condition,response
P001,25,control,I felt very relaxed during the session
P002,32,treatment,The intervention helped me focus better
P003,28,control,No significant changes noticed
P004,45,treatment,Remarkable improvement in my daily routine
Pipeline Template
name: analyze_responses
nodes:
- name: classify
type: Map
inputs: [documents]
---#classify
Analyze this participant response:
**Participant:**
**Age:**
**Condition:**
**Response:**
Based on the response, classify the sentiment and extract key themes.
[[classification]]
All column names (participant_id, age, condition, response) are automatically available as template variables.
Sampling and Filtering
Take First N Rows
# Process first 10 rows only
soak run my_pipeline data/survey.csv --head 10 -o test_run
Random Sample
# Randomly sample 50 rows
soak run my_pipeline data/large_survey.csv --sample 50 -o pilot_analysis
This is useful for:
- Testing pipelines on large datasets
- Pilot studies
- Cost estimation before full runs
Ground Truth Validation
When your CSV contains ground truth labels, you can automatically validate LLM predictions:
Example Data (coded_data.csv)
text,sentiment_actual,topic_actual
Great product!,positive,product
Terrible service,negative,service
Pipeline with Ground Truth
name: validate_classifier
nodes:
- name: classify
type: Classifier
inputs: [documents]
ground_truth:
sentiment:
existing: sentiment_actual # Compare to this column
mapping: null # Auto-detect mapping
topic:
existing: topic_actual
---#classify
Text:
Classify the sentiment (positive/negative/neutral) and topic: [[classification]]
This will automatically:
- Compare LLM predictions to ground truth labels
- Calculate precision, recall, F1 scores
- Generate confusion matrices
- Export results to CSV with accuracy metrics
See Ground Truth Validation for details.
Multi-Column Access
You can access multiple columns in a single template:
---#analyze
**Demographics:**
- ID:
- Age:
- Gender:
- Education:
**Study Data:**
- Condition:
- Session:
- Response:
Analyze the response considering the participant's demographics: [[analysis]]
Export Preserves Metadata
When you export classifier results, the original CSV columns are preserved:
soak run classifier_tabular data/responses.csv -o results --dump
Output CSV (results_dump/classify/classifications.csv) includes:
index,source_id,participant_id,age,condition,response,sentiment,topic
0,responses__row_0,P001,25,control,I felt very relaxed...,positive,relaxation
1,responses__row_1,P002,32,treatment,The intervention...,positive,focus
Supported Formats
- CSV (
.csv) – via pandasread_csv() - Excel (
.xlsx) – via pandasread_excel()with openpyxl
Common Use Cases
Survey Analysis
soak run classifier_tabular survey_responses.csv -o survey_analysis
Coded Transcripts
If you have pre-coded interview data in a spreadsheet:
# Each row is a coded segment
nodes:
- name: analyze_codes
type: Map
inputs: [documents]
---#analyze_codes
**Segment:**
**Speaker:**
**Existing Code:**
**Text:**
Compare the manual code with the text and suggest refinements: [[analysis]]
Longitudinal Data
For repeated measures:
---#analyze_change
**Participant:**
**Timepoint:**
**Measure:**
**Notes:**
Analyze change over time: [[analysis]]
Example: End-to-End Analysis
# 1. Test on small sample
soak run classifier_tabular survey.csv --head 20 -o test
# 2. Review test results
open test_pipeline.html
# 3. Run on full dataset
soak run classifier_tabular survey.csv -o full_analysis -v
# 4. Check CSV output (results are in the dump folder)
open full_analysis_dump/01_Classifier_classify/classifications.csv