Customizing Your Analysis

This tutorial shows how to adapt soak pipelines to your research needs by modifying prompts and pipeline structure.

Quick Customization: Context Variables

The fastest way to customize is using context variables with -c:

uv run soak zs data/*.txt \
  --output results \
  -c research_question="What factors influence treatment adherence?" \
  -c persona="Health psychologist specializing in chronic illness"

Context variables inject into templates via ``.

Available Variables (zs/zspe)

Check pipeline defaults:

uv run soak show pipeline zs | grep -A 5 "default_context"

Common variables:

persona - Who the LLM should act as (default: “Experienced qual researcher”)
research_question - Your specific research question (default: None)
excerpt_topics - Topics to extract (zspe only)

Deep Customization: Editing Pipelines

For more control, copy and modify pipeline files.

Step 1: Get the Pipeline

uv run soak show pipeline zs > my_analysis.soak

Step 2: Edit the YAML

Open my_analysis.soak in your editor. The file has two sections:

Front matter (YAML):

name: zero_shot
default_context:
  persona: Experienced qual researcher
  research_question: None

nodes:
  - name: chunks
    type: Split
    chunk_size: 30000
  # ... more nodes

Templates (Jinja2 + struckdown):

---#chunk_codes_and_themes

You are a: 

Identify all relevant codes in the text...

[[code*:codes]]

Step 3: Modify for Your Domain

Example: Adapting for education research

name: education_analysis
default_context:
  persona: Education researcher studying student engagement
  research_question: How do students experience online learning?

nodes:
  - name: chunks
    type: Split
    chunk_size: 20000  # Smaller chunks for detailed coding

  - name: chunk_codes_and_themes
    type: Map
    max_tokens: 16000
    inputs:
      - chunks

Then edit template:

---#chunk_codes_and_themes

You are a: 

Research question: 

Code this student interview transcript. Focus on:
- Learning experiences (positive and negative)
- Technology use and challenges
- Social interaction and isolation
- Motivation and engagement

A 'code' should capture specific aspects of the student experience.

Identify all codes, with:
- Name (8-15 words)
- Description (50 words)
- Direct quotes from the student

<text>

</text>

[[code*:codes]]

Now identify themes that group related codes...

[[theme*:themes]]

Step 4: Run Your Custom Pipeline

uv run soak my_analysis.soak data/student_interviews/*.txt --output results

Common Customizations

Change Code/Theme Criteria

Original (zs.soak):

A 'code' should be related to the desires, needs, and meaningful outcomes
for participants.

Modified for behavior analysis:

A 'code' should identify specific behaviors, actions, or practices mentioned
by participants. Focus on what people do, not just what they feel.

Adjust Number of Themes

Original:

Review and consolidate into ~7 (+/- 2) overarching major themes.

Modified:

Review and consolidate into ~4 major themes. Use fewer themes to capture
only the most prominent patterns.

Change Quote Requirements

Original:

Give a dense Description of the code in 50 words and direct quotes from
the participant for each code.

Modified:

Give a dense Description of the code in 50 words and 2-3 SHORT direct quotes
(max 2 sentences each) from the participant for each code.

Add Custom Instructions

Insert domain-specific guidance:

---#chunk_codes_and_themes

You are analyzing clinical interviews about treatment experiences.

IMPORTANT CONTEXT:
- Participants have chronic fatigue syndrome (CFS/ME)
- Many tried multiple treatments before finding help
- Recovery is often partial, not complete
- Medical dismissal is a common theme

When coding:
- Distinguish between complete/partial/no recovery
- Note treatments tried (medical, alternative, self-directed)
- Flag experiences of medical gaslighting or dismissal



[[code*:codes]]

Working with Return Types

soak uses struckdown syntax for structured outputs: [[return_type:field_name]]

Available Return Types

Thematic analysis:

[[code*:codes]] - List of Code objects
[[theme*:themes]] - List of Theme objects
[[extract:text]] - Free-form text extraction
[[report]] - Free-form narrative

Classification (see classifier.soak):

[[pick:field|option1,option2]] - Single choice
[[pick*:field|option1,option2]] - Multiple choice
[[int:field]] - Integer
[[boolean:field]] - Yes/no
[[text:field]] - Free text

Example: Custom Structured Output

---#assessment

Read this clinical note and extract structured information:



Patient diagnosis:
[[pick:diagnosis|cfs,me,both,unclear]]

Severity level:
[[pick:severity|mild,moderate,severe,very_severe]]

Primary symptoms (select all that apply):
[[pick*:symptoms|fatigue,pain,cognitive_issues,sleep_problems,pem]]

Duration of illness in years:
[[int:years_ill]]

Currently employed:
[[boolean:employed]]

Clinical notes:
[[text:notes]]

This creates a dictionary with typed fields.

Pipeline Structure Changes

Add a New Node

Insert a filtering step:

nodes:
  - name: chunks
    type: Split
    chunk_size: 30000

  - name: filter_relevant      # NEW NODE
    type: Map
    inputs:
      - chunks

  - name: chunk_codes_and_themes
    type: Map
    inputs:
      - filter_relevant        # Changed from 'chunks'

Then add template:

---#filter_relevant

Remove any text that is:
- Interviewer speech (unless needed for context)
- Off-topic small talk
- Administrative content

Keep only substantive participant responses.

[[extract:filtered_text]]

Remove a Node

Delete the node definition and its template. Update dependent nodes:

# Remove checkquotes node
nodes:
  # ... other nodes
  # - name: checkquotes      # REMOVED
  #   type: VerifyQuotes
  #   inputs:
  #     - codes

Delete the template section:

# ---#checkquotes           # DELETE THIS SECTION

Change Node Parameters

Adjust processing behavior:

- name: chunks
  type: Split
  chunk_size: 15000          # Smaller chunks
  overlap: 500               # Add overlap to preserve context

- name: chunk_codes_and_themes
  type: Map
  max_tokens: 8000           # Reduce max tokens
  temperature: 0.3           # Lower temperature = more consistent
  inputs:
    - chunks

Testing Your Changes

Test on Small Data

# Test with single file first
uv run soak my_analysis.soak data/test_interview.txt -f json | jq '.codes'

Check Intermediate Outputs

# Run pipeline to inspect each stage (dump created automatically)
uv run soak my_analysis.soak data/test.txt -o test

# Review specific node output
cat test_dump/02_Map_chunk_codes_and_themes/0000_*_response.json | jq

Validate Templates

Templates use Jinja2. Test syntax:

from jinja2 import Template

template = Template("")
print(template.render(research_question="What is recovery?"))

Example: Complete Custom Pipeline

Here’s a focused pipeline for analyzing treatment experiences:

name: treatment_analysis
default_context:
  persona: Medical anthropologist
  condition: chronic fatigue syndrome

nodes:
  - name: chunks
    type: Split
    chunk_size: 25000

  - name: treatment_codes
    type: Map
    max_tokens: 12000
    inputs:
      - chunks

  - name: all_treatments
    type: Reduce
    inputs:
      - treatment_codes

  - name: consolidated_treatments
    type: Transform
    inputs:
      - all_treatments

  - name: narrative
    type: Transform
    inputs:
      - consolidated_treatments

---#treatment_codes

You are a  analyzing patient experiences with .

Focus exclusively on TREATMENT experiences. Code for:
- Treatments tried (name them specifically)
- Effectiveness (helped/hurt/no effect)
- Side effects
- Reasons for starting/stopping
- Provider relationships



[[code*:codes]]

---#all_treatments



---#consolidated_treatments

Merge duplicate treatments from different transcripts.



[[codenotes]]

[[code*:codes]]

---#narrative

Summarize treatment patterns:



[[report]]

Run it:

uv run soak treatment_analysis.soak data/*.txt -o treatment_results

Tips

Keep original pipeline:

cp my_analysis.soak my_analysis_backup.soak

Version your pipelines:

name: education_analysis_v2
# Note: Changed theme consolidation prompt

Document your changes:

# Added filtering node to remove interviewer speech
# Reduced chunk_size from 30k to 20k for finer granularity
# Modified code criteria to focus on behaviors not feelings

Start small:

Change one thing at a time and test. Don’t modify multiple nodes simultaneously.

Use verbose mode:

uv run soak my_analysis.soak data/test.txt -o test -v

Shows what’s happening at each stage.

Next Steps

Working with Results - Analyzing output data
Node Types - Understanding different nodes
Template System - Advanced template features
Node Reference - All node parameters