Supported Document Formats

soak extracts text from common document formats and converts everything to Markdown for consistent processing. This page explains which formats are supported and how extraction works.

Supported formats

Format Extension Extraction method
Word .docx pandoc
RTF .rtf pandoc
Plain text .txt pandoc
Markdown .md, .markdown pandoc
PDF .pdf pdfplumber
CSV .csv pandas (structured)
Excel .xlsx, .xls pandas (structured)

Requirements

pandoc must be installed for document extraction (Word, RTF, text, Markdown):

# macOS
brew install pandoc

# Ubuntu/Debian
sudo apt install pandoc

# Windows
choco install pandoc

PDF extraction uses pdfplumber (installed automatically with soak).

How extraction works

Documents → Markdown

Word, RTF, text, and Markdown files are converted to GitHub-Flavoured Markdown (GFM) using pandoc. This preserves:

  • Headings and structure
  • Lists (bulleted and numbered)
  • Bold, italic, and other emphasis
  • Tables
  • Links and footnotes

Example conversion from a Word document:

Original DOCX:
  [Heading 1] Interview with Participant 5
  [Bold] Date: [/Bold] 15 March 2024
  [Bullet] First point
  [Bullet] Second point

Extracted Markdown:
  # Interview with Participant 5

  **Date:** 15 March 2024

  - First point
  - Second point

PDF extraction

PDFs are processed with pdfplumber to extract embedded text only:

  • No OCR (scanned documents won’t work)
  • No layout reconstruction
  • Paragraph breaks preserved where detectable
  • Page breaks become double newlines

For scanned PDFs, convert to searchable PDF first using OCR software.

Whitespace normalisation

All extracted text is normalised:

  • Windows line endings (\r\n) converted to Unix (\n)
  • Multiple spaces collapsed to single space
  • Three or more consecutive newlines collapsed to two

Usage examples

Single file

soak run thematic interview.docx -o analysis

Multiple files

soak run thematic interviews/*.rtf -o analysis

Mixed formats

soak run thematic data/interview1.docx data/interview2.pdf notes.txt -o analysis

ZIP archives

soak can extract files from ZIP archives:

soak run thematic interviews.zip -o analysis

Files inside the ZIP are extracted to a temporary directory, processed, then cleaned up.

Troubleshooting

“pandoc not found”

Install pandoc using your system package manager (see Requirements above).

RTF files have no paragraph breaks

Old RTF files may use non-standard paragraph markers. If paragraphs run together, try opening in Word/LibreOffice and re-saving as .docx.

PDF extraction returns empty text

The PDF likely contains scanned images rather than embedded text. Use OCR software to create a searchable PDF first.

Encoding errors

soak expects UTF-8 encoded files. For files with other encodings, convert first:

iconv -f ISO-8859-1 -t UTF-8 input.txt > input_utf8.txt

Programmatic access

You can use the extraction functions directly in Python:

from soak.document_utils import extract_text, get_supported_extensions

# Check supported formats
print(get_supported_extensions())
# ['.docx', '.rtf', '.txt', '.md', '.markdown', '.pdf', '.csv', '.xlsx', '.xls']

# Extract text from a document
text = extract_text("interview.docx")
print(text)  # Markdown string

# Spreadsheets return structured data
rows = extract_text("survey.csv")
print(rows)  # List of dicts, one per row

Format-specific notes

Word (.docx)

  • Comments and tracked changes are ignored
  • Document metadata is stripped
  • Headers/footers are included in extraction
  • Embedded images are ignored (text only)

RTF

  • Structural elements preserved where possible
  • Complex formatting may be simplified
  • Older RTF variants may have reduced fidelity

Plain text (.txt)

  • Treated as Markdown for normalisation
  • No formatting interpretation
  • UTF-8 encoding expected

Markdown (.md, .markdown)

  • Passed through with whitespace normalisation only
  • Frontmatter (YAML headers) preserved
  • Valid GFM output guaranteed

PDF

  • Text extraction only, no OCR
  • Reading order follows PDF structure
  • Multi-column layouts may interleave incorrectly
  • Tables extracted as plain text (no structure)

See also


This site uses Just the Docs, a documentation theme for Jekyll.