Pre-extraction Workflow
The zspe (zero-shot pre-extract) pipeline filters text before analysis. Use this when documents contain irrelevant content or when you want to focus on specific topics.
When to Use Pre-extraction
Use zspe when:
- Documents are long with mixed content (e.g., full interview transcripts with off-topic discussion)
- You want to analyze specific topics mentioned throughout documents
- Source material includes interviewer speech you want to exclude
- You need to reduce processing time/cost by filtering early
Use standard zs when:
- All content is relevant
- You want comprehensive analysis
- Documents are already focused on your topic
Pipeline Overview
The zspe pipeline adds a pre-extraction step before coding:
1. Split documents into chunks
2. Extract relevant excerpts from each chunk ← NEW
3. Generate codes and themes from excerpts
4. Reduce and consolidate (as in zs)
5. Final themes and narrative
Running the Pipeline
uv run soak zspe data/interviews/*.txt \
--output results \
-c excerpt_topics="Exercise, physical rehabilitation, and recovery through movement"
The excerpt_topics context variable tells the LLM what to extract.
Pipeline Stages
Stage 1: chunks (Split)
Same as zs pipeline - splits documents into manageable chunks.
- name: chunks
type: Split
chunk_size: 30000
Stage 2: extract_relevant_excepts (Map)
This node filters each chunk to relevant content.
- name: extract_relevant_excepts
type: Map
inputs:
- chunks
Template:
You are pre-reading qualitative transcripts to identify the most relevant
exceptions to the research question.
You will extract text from the interview according to the following criteria:
- only patients' speech (not the interviewer, unless necessary for context)
- EXACTLY the text from the transcript, VERBATIM, with no amendments
- only the sections that are relevant to the research question
<transcribed text>
</transcribed text>
Extract elements of the text relevant to the research question:
Copy text related to this question/topic verbatim.
[[extract:relevant_content]]
Key points:
- `` comes from CLI
-coption - Template emphasizes VERBATIM extraction (no paraphrasing)
- Filters to participant speech only
- Uses
[[extract:relevant_content]]for free-form text output
Input: Each chunk Output: Filtered text containing only relevant excerpts
Stage 3+: chunk_codes_and_themes (Map)
Same as zs pipeline, but operates on extracted content instead of full chunks.
- name: chunk_codes_and_themes
type: Map
max_tokens: 16000
inputs:
- extract_relevant_excepts # Note: uses filtered content
The rest of the pipeline (all_codes, all_themes, codes, themes, narrative) is identical to zs.
Comparison: zs vs zspe
Standard zs
Document → Chunks → Code all chunks → Consolidate
Codes everything, including off-topic content.
Pre-extraction zspe
Document → Chunks → Extract relevant → Code excerpts → Consolidate
Codes only content matching your topic.
Example Workflow
Step 1: Identify Your Topic
What aspect of the data matters for your research question?
Examples:
- “Recovery experiences and symptom improvement”
- “Work and employment challenges”
- “Social relationships and support systems”
- “Treatment experiences and medical interactions”
Step 2: Run with Topic
uv run soak zspe data/*.txt \
--output results \
-c excerpt_topics="Recovery experiences and symptom improvement"
Step 3: Review Extractions
Check results_dump/02_Map_extract_relevant_excepts/ to see what was extracted:
# View an extraction from the dump directory (created automatically)
cat results_dump/02_Map_extract_relevant_excepts/0000_*_response.json | jq '.relevant_content'
If extractions are too narrow/broad, adjust excerpt_topics and re-run.
Step 4: Analyze Results
View codes and themes as usual:
open results.html
Codes will focus on your specified topics.
Customization
Multiple Topics
uv run soak zspe data/*.txt \
--output results \
-c excerpt_topics="1) Physical symptoms and energy levels, 2) Social isolation, 3) Medical treatment"
Adjust Extraction Instructions
Copy and modify the template:
uv run soak show pipeline zspe > my_zspe.soak
Edit extract_relevant_excepts section:
---#extract_relevant_excepts
Extract ONLY direct quotes where participants discuss:
Rules:
- Participant speech only
- Keep full sentences for context
- Include emotional language
- Preserve hesitations and emphasis
[[extract:relevant_content]]
Include Interviewer Context
Modify template to keep interviewer questions:
Extract text according to criteria:
- Participant responses about:
- Include interviewer questions immediately before responses
- Keep verbatim
[[extract:relevant_content]]
Tips
Check extraction quality:
Always verify extractions match expectations by reviewing the dump directory:
uv run soak zspe data/*.txt -o results
# Then review results_dump/02_Map_extract_relevant_excepts/
Too little extracted:
- Broaden
excerpt_topicsdescription - Check if topic actually appears in data
- Review original documents
Too much extracted:
- Narrow
excerpt_topicsto specific aspects - Add exclusion criteria to template
- Use more specific language
Extraction misses context:
Increase chunk_size so related content stays together:
- name: chunks
type: Split
chunk_size: 50000 # Larger chunks = better context
Paraphrasing instead of verbatim:
Emphasize in template:
CRITICAL: Copy text EXACTLY as written. Do not paraphrase, summarize, or edit.
Use "..." to mark skipped sections.
When Pre-extraction Helps
Long Mixed Documents
Interview transcripts often wander off-topic. Pre-extraction keeps analysis focused:
Full transcript: 50,000 words
After extraction: 8,000 words (relevant content only)
Result: Faster, cheaper, more focused codes
Multi-topic Datasets
Run multiple analyses on same data with different topics:
# Analysis 1: Physical health
uv run soak zspe data/*.txt \
-o health_analysis \
-c excerpt_topics="Physical symptoms and bodily experiences"
# Analysis 2: Social impact
uv run soak zspe data/*.txt \
-o social_analysis \
-c excerpt_topics="Relationships and social interactions"
# Compare results
uv run soak compare social_analysis.json health_analysis.json
Filtering Irrelevant Speakers
Transcripts with multiple speakers:
Extract criteria:
- Patient speech only
- Exclude clinician, researcher, family members
- Include only when discussing:
Common Issues
Empty extractions:
Check if excerpt_topics matches actual content:
# Verify topic exists in data
grep -i "recovery" data/*.txt
Extraction too aggressive:
LLM might filter out important context. Review and adjust:
cat results_dump/02_Map_extract_relevant_excepts/0000_*_response.json
Quotes don’t verify:
VerifyQuotes may fail if extraction modified text. Ensure template emphasizes verbatim copying.
Next Steps
- Thematic Analysis - Understanding the full pipeline
- Customizing Your Analysis - Adapting prompts
- Node Types - Understanding Map nodes