Pre-extraction Workflow

The zspe (zero-shot pre-extract) pipeline filters text before analysis. Use this when documents contain irrelevant content or when you want to focus on specific topics.

When to Use Pre-extraction

Use zspe when:

Documents are long with mixed content (e.g., full interview transcripts with off-topic discussion)
You want to analyze specific topics mentioned throughout documents
Source material includes interviewer speech you want to exclude
You need to reduce processing time/cost by filtering early

Use standard zs when:

All content is relevant
You want comprehensive analysis
Documents are already focused on your topic

Pipeline Overview

The zspe pipeline adds a pre-extraction step before coding:

Split documents into chunks
Extract relevant excerpts from each chunk  ← NEW
Generate codes and themes from excerpts
Reduce and consolidate (as in zs)
Final themes and narrative

Running the Pipeline

uv run soak zspe data/interviews/*.txt \
  --output results \
  -c excerpt_topics="Exercise, physical rehabilitation, and recovery through movement"

The excerpt_topics context variable tells the LLM what to extract.

Pipeline Stages

Stage 1: chunks (Split)

Same as zs pipeline - splits documents into manageable chunks.

- name: chunks
  type: Split
  chunk_size: 30000

Stage 2: extract_relevant_excepts (Map)

This node filters each chunk to relevant content.

- name: extract_relevant_excepts
  type: Map
  inputs:
    - chunks

Template:

You are pre-reading qualitative transcripts to identify the most relevant
exceptions to the research question.

You will extract text from the interview according to the following criteria:
- only patients' speech (not the interviewer, unless necessary for context)
- EXACTLY the text from the transcript, VERBATIM, with no amendments
- only the sections that are relevant to the research question

<transcribed text>

</transcribed text>

Extract elements of the text relevant to the research question:

Copy text related to this question/topic verbatim.

[[extract:relevant_content]]

Key points:

`` comes from CLI -c option
Template emphasizes VERBATIM extraction (no paraphrasing)
Filters to participant speech only
Uses [[extract:relevant_content]] for free-form text output

Input: Each chunk Output: Filtered text containing only relevant excerpts

Stage 3+: chunk_codes_and_themes (Map)

Same as zs pipeline, but operates on extracted content instead of full chunks.

- name: chunk_codes_and_themes
  type: Map
  max_tokens: 16000
  inputs:
    - extract_relevant_excepts  # Note: uses filtered content

The rest of the pipeline (all_codes, all_themes, codes, themes, narrative) is identical to zs.

Comparison: zs vs zspe

Standard zs

Document → Chunks → Code all chunks → Consolidate

Codes everything, including off-topic content.

Pre-extraction zspe

Document → Chunks → Extract relevant → Code excerpts → Consolidate

Codes only content matching your topic.

Example Workflow

Step 1: Identify Your Topic

What aspect of the data matters for your research question?

Examples:

“Recovery experiences and symptom improvement”
“Work and employment challenges”
“Social relationships and support systems”
“Treatment experiences and medical interactions”

Step 2: Run with Topic

uv run soak zspe data/*.txt \
  --output results \
  -c excerpt_topics="Recovery experiences and symptom improvement"

Step 3: Review Extractions

Check results_dump/02_Map_extract_relevant_excepts/ to see what was extracted:

# View an extraction from the dump directory (created automatically)
cat results_dump/02_Map_extract_relevant_excepts/0000_*_response.json | jq '.relevant_content'

If extractions are too narrow/broad, adjust excerpt_topics and re-run.

Step 4: Analyze Results

View codes and themes as usual:

open results.html

Codes will focus on your specified topics.

Customization

Multiple Topics

uv run soak zspe data/*.txt \
  --output results \
  -c excerpt_topics="1) Physical symptoms and energy levels, 2) Social isolation, 3) Medical treatment"

Adjust Extraction Instructions

Copy and modify the template:

uv run soak show pipeline zspe > my_zspe.soak

Edit extract_relevant_excepts section:

---#extract_relevant_excepts

Extract ONLY direct quotes where participants discuss:



Rules:
- Participant speech only
- Keep full sentences for context
- Include emotional language
- Preserve hesitations and emphasis

[[extract:relevant_content]]

Include Interviewer Context

Modify template to keep interviewer questions:

Extract text according to criteria:
- Participant responses about: 
- Include interviewer questions immediately before responses
- Keep verbatim

[[extract:relevant_content]]

Tips

Check extraction quality:

Always verify extractions match expectations by reviewing the dump directory:

uv run soak zspe data/*.txt -o results
# Then review results_dump/02_Map_extract_relevant_excepts/

Too little extracted:

Broaden excerpt_topics description
Check if topic actually appears in data
Review original documents

Too much extracted:

Narrow excerpt_topics to specific aspects
Add exclusion criteria to template
Use more specific language

Extraction misses context:

Increase chunk_size so related content stays together:

- name: chunks
  type: Split
  chunk_size: 50000  # Larger chunks = better context

Paraphrasing instead of verbatim:

Emphasize in template:

CRITICAL: Copy text EXACTLY as written. Do not paraphrase, summarize, or edit.
Use "..." to mark skipped sections.

When Pre-extraction Helps

Long Mixed Documents

Interview transcripts often wander off-topic. Pre-extraction keeps analysis focused:

Full transcript: 50,000 words
After extraction: 8,000 words (relevant content only)
Result: Faster, cheaper, more focused codes

Multi-topic Datasets

Run multiple analyses on same data with different topics:

# Analysis 1: Physical health
uv run soak zspe data/*.txt \
  -o health_analysis \
  -c excerpt_topics="Physical symptoms and bodily experiences"

# Analysis 2: Social impact
uv run soak zspe data/*.txt \
  -o social_analysis \
  -c excerpt_topics="Relationships and social interactions"

# Compare results
uv run soak compare social_analysis.json health_analysis.json

Filtering Irrelevant Speakers

Transcripts with multiple speakers:

Extract criteria:
- Patient speech only
- Exclude clinician, researcher, family members
- Include only when discussing: 

Common Issues

Empty extractions:

Check if excerpt_topics matches actual content:

# Verify topic exists in data
grep -i "recovery" data/*.txt

Extraction too aggressive:

LLM might filter out important context. Review and adjust:

cat results_dump/02_Map_extract_relevant_excepts/0000_*_response.json

Quotes don’t verify:

VerifyQuotes may fail if extraction modified text. Ensure template emphasizes verbatim copying.

Next Steps

Thematic Analysis - Understanding the full pipeline
Customizing Your Analysis - Adapting prompts
Node Types - Understanding Map nodes