Working with Spreadsheet Data (CSV/XLSX)

soak can process CSV and Excel files directly, treating each row as a separate document. This is useful for survey data, coded transcripts, or any tabular data you want to analyze with LLMs.

Quick Start

# Process CSV file
soak run classifier_tabular data/responses.csv -o results

# Process Excel file
soak run my_pipeline data/survey.xlsx -o analysis

How It Works

When you provide a CSV or XLSX file as input:

Each row becomes a document – soak creates one TrackedItem per row
Columns become template variables – All column values are accessible in your pipeline templates as ``
Content is in metadata – Unlike text files, the row data is stored in metadata, not in content
Provenance is tracked – Each row gets a unique source_id like filename__row_0, filename__row_1, etc.

Accessing Column Data in Templates

Example Data (responses.csv)

participant_id,age,condition,response
P001,25,control,I felt very relaxed during the session
P002,32,treatment,The intervention helped me focus better
P003,28,control,No significant changes noticed
P004,45,treatment,Remarkable improvement in my daily routine

Pipeline Template

name: analyze_responses

nodes:
  - name: classify
    type: Map
    inputs: [documents]

---#classify

Analyze this participant response:

**Participant:** 
**Age:** 
**Condition:** 
**Response:** 

Based on the response, classify the sentiment and extract key themes.

[[classification]]

All column names (participant_id, age, condition, response) are automatically available as template variables.

Sampling and Filtering

Take First N Rows

# Process first 10 rows only
soak run my_pipeline data/survey.csv --head 10 -o test_run

Random Sample

# Randomly sample 50 rows
soak run my_pipeline data/large_survey.csv --sample 50 -o pilot_analysis

This is useful for:

Testing pipelines on large datasets
Pilot studies
Cost estimation before full runs

Ground Truth Validation

When your CSV contains ground truth labels, you can automatically validate LLM predictions:

Example Data (coded_data.csv)

text,sentiment_actual,topic_actual
Great product!,positive,product
Terrible service,negative,service

Pipeline with Ground Truth

name: validate_classifier

nodes:
  - name: classify
    type: Classifier
    inputs: [documents]
    ground_truth:
      sentiment:
        existing: sentiment_actual  # Compare to this column
        mapping: null  # Auto-detect mapping
      topic:
        existing: topic_actual

---#classify

Text: 

Classify the sentiment (positive/negative/neutral) and topic: [[classification]]

This will automatically:

Compare LLM predictions to ground truth labels
Calculate precision, recall, F1 scores
Generate confusion matrices
Export results to CSV with accuracy metrics

See Ground Truth Validation for details.

Multi-Column Access

You can access multiple columns in a single template:

---#analyze

**Demographics:**
- ID: 
- Age: 
- Gender: 
- Education: 

**Study Data:**
- Condition: 
- Session: 
- Response: 

Analyze the response considering the participant's demographics: [[analysis]]

Export Preserves Metadata

When you export classifier results, the original CSV columns are preserved:

soak run classifier_tabular data/responses.csv -o results --dump

Output CSV (results_dump/classify/classifications.csv) includes:

index,source_id,participant_id,age,condition,response,sentiment,topic
0,responses__row_0,P001,25,control,I felt very relaxed...,positive,relaxation
1,responses__row_1,P002,32,treatment,The intervention...,positive,focus

Supported Formats

CSV (.csv) – via pandas read_csv()
Excel (.xlsx) – via pandas read_excel() with openpyxl

Common Use Cases

Survey Analysis

soak run classifier_tabular survey_responses.csv -o survey_analysis

Coded Transcripts

If you have pre-coded interview data in a spreadsheet:

# Each row is a coded segment
nodes:
  - name: analyze_codes
    type: Map
    inputs: [documents]

---#analyze_codes

**Segment:** 
**Speaker:** 
**Existing Code:** 
**Text:** 

Compare the manual code with the text and suggest refinements: [[analysis]]

Longitudinal Data

For repeated measures:

---#analyze_change

**Participant:** 
**Timepoint:** 
**Measure:** 
**Notes:** 

Analyze change over time: [[analysis]]

Example: End-to-End Analysis

# 1. Test on small sample
soak run classifier_tabular survey.csv --head 20 -o test

# 2. Review test results
open test_pipeline.html

# 3. Run on full dataset
soak run classifier_tabular survey.csv -o full_analysis -v

# 4. Check CSV output (results are in the dump folder)
open full_analysis_dump/01_Classifier_classify/classifications.csv