Supported Document Formats

soak extracts text from common document formats and converts everything to Markdown for consistent processing. This page explains which formats are supported and how extraction works.

Supported formats

Format	Extension	Extraction method
Word	`.docx`	pandoc
RTF	`.rtf`	pandoc
Plain text	`.txt`	pandoc
Markdown	`.md`, `.markdown`	pandoc
PDF	`.pdf`	pdfplumber
CSV	`.csv`	pandas (structured)
Excel	`.xlsx`, `.xls`	pandas (structured)

Requirements

pandoc must be installed for document extraction (Word, RTF, text, Markdown):

# macOS
brew install pandoc

# Ubuntu/Debian
sudo apt install pandoc

# Windows
choco install pandoc

PDF extraction uses pdfplumber (installed automatically with soak).

How extraction works

Documents → Markdown

Word, RTF, text, and Markdown files are converted to GitHub-Flavoured Markdown (GFM) using pandoc. This preserves:

Headings and structure
Lists (bulleted and numbered)
Bold, italic, and other emphasis
Tables
Links and footnotes

Example conversion from a Word document:

Original DOCX:
  [Heading 1] Interview with Participant 5
  [Bold] Date: [/Bold] 15 March 2024
  [Bullet] First point
  [Bullet] Second point

Extracted Markdown:
  # Interview with Participant 5

  **Date:** 15 March 2024

  - First point
  - Second point

PDF extraction

PDFs are processed with pdfplumber to extract embedded text only:

No OCR (scanned documents won’t work)
No layout reconstruction
Paragraph breaks preserved where detectable
Page breaks become double newlines

For scanned PDFs, convert to searchable PDF first using OCR software.

Whitespace normalisation

All extracted text is normalised:

Windows line endings (\r\n) converted to Unix (\n)
Multiple spaces collapsed to single space
Three or more consecutive newlines collapsed to two

Usage examples

Single file

soak run thematic interview.docx -o analysis

Multiple files

soak run thematic interviews/*.rtf -o analysis

Mixed formats

soak run thematic data/interview1.docx data/interview2.pdf notes.txt -o analysis

ZIP archives

soak can extract files from ZIP archives:

soak run thematic interviews.zip -o analysis

Files inside the ZIP are extracted to a temporary directory, processed, then cleaned up.

Troubleshooting

“pandoc not found”

Install pandoc using your system package manager (see Requirements above).

RTF files have no paragraph breaks

Old RTF files may use non-standard paragraph markers. If paragraphs run together, try opening in Word/LibreOffice and re-saving as .docx.

PDF extraction returns empty text

The PDF likely contains scanned images rather than embedded text. Use OCR software to create a searchable PDF first.

Encoding errors

soak expects UTF-8 encoded files. For files with other encodings, convert first:

iconv -f ISO-8859-1 -t UTF-8 input.txt > input_utf8.txt

Programmatic access

You can use the extraction functions directly in Python:

from soak.document_utils import extract_text, get_supported_extensions

# Check supported formats
print(get_supported_extensions())
# ['.docx', '.rtf', '.txt', '.md', '.markdown', '.pdf', '.csv', '.xlsx', '.xls']

# Extract text from a document
text = extract_text("interview.docx")
print(text)  # Markdown string

# Spreadsheets return structured data
rows = extract_text("survey.csv")
print(rows)  # List of dicts, one per row

Format-specific notes

Word (.docx)

Comments and tracked changes are ignored
Document metadata is stripped
Headers/footers are included in extraction
Embedded images are ignored (text only)

RTF

Structural elements preserved where possible
Complex formatting may be simplified
Older RTF variants may have reduced fidelity

Plain text (.txt)

Treated as Markdown for normalisation
No formatting interpretation
UTF-8 encoding expected

Markdown (.md, .markdown)

Passed through with whitespace normalisation only
Frontmatter (YAML headers) preserved
Valid GFM output guaranteed

PDF

Text extraction only, no OCR
Reading order follows PDF structure
Multi-column layouts may interleave incorrectly
Tables extracted as plain text (no structure)