Docfold

Turn any document into structured data. Unified Python toolkit for document structuring — one interface, 16 engines, built-in benchmarks.

Read the announcement: Docfold - open-source document processing toolkit

Engine Comparison

Research-based estimates from public benchmarks, documentation, and community reports. See detailed methodology. Run your own: docfold compare your_doc.pdf

Engine	docfold	Type	License	Text PDF	Scan/OCR	Tables	BBox	Conf	Speed	Cost
Docling	✅	Local	MIT	★★★	★★☆	★★★	✅	—	Medium	Free
MinerU	✅	Local	AGPL	★★★	★★★	★★★	—	—	Slow	Free
Marker	✅	SaaS	Paid	★★★	★★★	★★★	✅	—	Fast	$$
PyMuPDF	✅	Local	AGPL	★★★	☆☆☆	★☆☆	—	—	Ultra	Free
PaddleOCR	✅	Local	Apache	★☆☆	★★★	★★☆	—	✅	Medium	Free
Tesseract	✅	Local	Apache	★☆☆	★★☆	★☆☆	—	—	Medium	Free
EasyOCR	✅	Local	Apache	★☆☆	★★★	☆☆☆	—	✅	Medium	Free
Unstructured	✅	Local	Apache	★★☆	★★☆	★★☆	—	—	Medium	Free
LlamaParse	✅	SaaS	Paid	★★★	★★★	★★★	—	—	Fast	$$
Mistral OCR	✅	SaaS	Paid	★★★	★★★	★★★	—	—	Fast	$$
Zerox	✅	VLM	MIT	★★★	★★★	★★☆	—	—	Slow	$$$
AWS Textract	✅	SaaS	Paid	★★★	★★★	★★★	✅	✅	Fast	$$
Google Doc AI	✅	SaaS	Paid	★★★	★★★	★★★	✅	✅	Fast	$$
Azure Doc Intel	✅	SaaS	Paid	★★★	★★★	★★★	✅	✅	Fast	$$
Nougat	✅	Local	MIT	★★★	★★☆	★★☆	—	—	Slow	Free
Surya	✅	Local	GPL	★★☆	★★★	★★☆	✅	✅	Medium	Free

★★★ Excellent ★★☆ Good ★☆☆ Basic ☆☆☆ Not supported — $$ ~$1-3/1K pages $$$ ~$5-15/1K pages — BBox Bounding boxes — Conf Confidence scores

Full engine profiles, format matrix, hardware requirements, and cost breakdown →

How to Choose

Your situation	Recommended engine
Digital PDF, speed is critical	PyMuPDF — zero deps, ~1000 pages/sec
Scanned documents, need OCR	PaddleOCR (80+ langs), EasyOCR (PyTorch), or Tesseract (100+ langs)
Complex layouts + tables	Docling or MinerU (free), LlamaParse (paid)
Academic papers + math formulas	MinerU or Nougat (free), Mistral OCR (paid)
Best quality, budget available	Mistral OCR or LlamaParse
Use any Vision LLM (GPT-4o, Claude, etc.)	Zerox — model-agnostic
Self-hosted, all-in-one ETL	Unstructured with hi_res strategy
Diverse file types (not just PDF)	Docling or Unstructured
Need bounding boxes + confidence	Textract, Google DocAI, or Azure DocInt
Office files (DOCX/PPTX/XLSX)	Docling, Marker, Unstructured, or Azure DocInt
AWS/GCP/Azure native pipeline	Textract / Google DocAI / Azure DocInt

Why Docfold?

Every engine has trade-offs. Docfold lets you switch between them with one line:

Challenge	Without Docfold	With Docfold
Try a new engine	Rewrite your pipeline	Change one string: `engine_hint="docling"`
Compare quality	Manual side-by-side	`router.compare("doc.pdf")` — one line
Batch 1000 files	Build your own concurrency	`router.process_batch(files, concurrency=5)`
Measure accuracy	Write custom metrics	Built-in CER, WER, Table F1, Reading Order
Switch engines later	Major refactor	Zero code changes — same `EngineResult`

from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
from docfold.engines.pymupdf_engine import PyMuPDFEngine

router = EngineRouter([DoclingEngine(), PyMuPDFEngine()])

# Auto-select the best available engine
result = await router.process("invoice.pdf")
print(result.content)       # Markdown output
print(result.engine_name)   # Which engine was used
print(result.processing_time_ms)

# Compare all engines on the same document
results = await router.compare("invoice.pdf")
for name, res in results.items():
    print(f"{name}: {len(res.content)} chars in {res.processing_time_ms}ms")

Supported Engines

Engine	Type	License	Formats	GPU	Install
Docling	Local	MIT	PDF, DOCX, PPTX, XLSX, HTML, images	No	`pip install docfold[docling]`
MinerU	Local	AGPL-3.0	PDF	Recommended	`pip install docfold[mineru]`
Marker API	SaaS	Paid	PDF, Office, images	N/A	`pip install docfold[marker]`
PyMuPDF	Local	AGPL-3.0	PDF	No	`pip install docfold[pymupdf]`
PaddleOCR	Local	Apache-2.0	Images, scanned PDFs	Optional	`pip install docfold[paddleocr]`
Tesseract	Local	Apache-2.0	Images, scanned PDFs	No	`pip install docfold[tesseract]`
EasyOCR	Local	Apache-2.0	Images, scanned PDFs	Optional	`pip install docfold[easyocr]`
Unstructured	Local	Apache-2.0	PDF, Office, HTML, email, ePub	Optional	`pip install docfold[unstructured]`
LlamaParse	SaaS	Paid	PDF, Office, images	N/A	`pip install docfold[llamaparse]`
Mistral OCR	SaaS	Paid	PDF, images	N/A	`pip install docfold[mistral-ocr]`
Zerox	VLM	MIT	PDF, images	Depends	`pip install docfold[zerox]`
AWS Textract	SaaS	Paid	PDF, images	N/A	`pip install docfold[textract]`
Google Doc AI	SaaS	Paid	PDF, images	N/A	`pip install docfold[google-docai]`
Azure Doc Intel	SaaS	Paid	PDF, Office, HTML, images	N/A	`pip install docfold[azure-docint]`
Nougat	Local	MIT (code)	PDF	Recommended	`pip install docfold[nougat]`
Surya	Local	GPL-3.0	PDF, images	Optional	`pip install docfold[surya]`

Adding your own engine? Implement the DocumentEngine interface — see Adding a Custom Engine below.

Installation

# Core only (no engines — useful for writing custom adapters)
pip install docfold

# With specific engines
pip install docfold[docling]
pip install docfold[docling,pymupdf,tesseract]

# Everything
pip install docfold[all]

Requires Python 3.10+.

CLI

# Convert a document
docfold convert invoice.pdf
docfold convert report.pdf --engine docling --format html --output report.html

# List available engines
docfold engines

# Compare engines on a document
docfold compare invoice.pdf

# Run evaluation benchmark
docfold evaluate tests/evaluation/dataset/ --output report.json

Batch Processing

Process hundreds of documents with bounded concurrency and progress tracking:

from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine

router = EngineRouter([DoclingEngine()])

# Simple batch
batch = await router.process_batch(
    ["invoice1.pdf", "invoice2.pdf", "report.docx"],
    concurrency=3,
)
print(f"{batch.succeeded}/{batch.total} succeeded in {batch.total_time_ms}ms")

# With progress callback
def on_progress(*, current, total, file_path, engine_name, status, **_):
    print(f"[{current}/{total}] {status}: {file_path} ({engine_name})")

batch = await router.process_batch(
    file_paths,
    concurrency=5,
    on_progress=on_progress,
)

# Access results
for path, result in batch.results.items():
    print(f"{path}: {len(result.content)} chars")

# Check errors
for path, error in batch.errors.items():
    print(f"FAILED {path}: {error}")

Unified Result Format

Every engine returns the same EngineResult dataclass:

@dataclass
class EngineResult:
    content: str              # The extracted text (markdown/html/json/text)
    format: OutputFormat      # markdown | html | json | text
    engine_name: str          # Which engine produced this
    metadata: dict            # Engine-specific metadata
    pages: int | None         # Number of pages processed
    images: dict | None       # Extracted images {filename: base64}
    tables: list | None       # Extracted tables
    bounding_boxes: list | None  # Layout element positions
    confidence: float | None  # Overall confidence [0-1]
    processing_time_ms: int   # Wall-clock time

Evaluation Framework

Docfold includes a built-in evaluation harness to objectively compare engines:

pip install docfold[evaluation]
docfold evaluate path/to/dataset/ --engines docling,pymupdf,marker

Metrics measured:

Metric	What it measures	Target
CER (Character Error Rate)	Character-level text accuracy	< 0.05
WER (Word Error Rate)	Word-level text accuracy	< 0.10
Table F1	Table detection and cell content accuracy	> 0.85
Heading F1	Heading detection precision/recall	> 0.90
Reading Order Score	Correctness of reading order (Kendall's tau)	> 0.90

See docs/evaluation.md for the ground truth JSON schema and detailed usage.

Architecture

                        ┌─────────────────────────────┐
                        │       Your Application      │
                        └──────────┬──────────────────┘
                                   │
                        ┌──────────▼──────────────────┐
                        │       EngineRouter          │
                        │  select() / process()       │
                        │  process_batch() / compare() │
                        └──────────┬──────────────────┘
                                   │
     ┌──────────┬───────┬──────────┴──────┬──────────┬──────────┐
     ▼          ▼       ▼                 ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌──────────┐  ┌────────┐ ┌────────┐ ┌──────┐
│Docling │ │ MinerU │ │Unstructd │  │ Marker │ │PyMuPDF │ │ OCR  │
│(local) │ │(local) │ │ (local)  │  │ (SaaS) │ │(local) │ │Paddle│
└────────┘ └────────┘ └──────────┘  └────────┘ └────────┘ │Tess. │
     │          │           │            │          │      └──────┘
     │     ┌────────┐ ┌──────────┐ ┌────────┐      │          │
     │     │Llama   │ │ Mistral  │ │ Zerox  │      │          │
     │     │Parse   │ │  OCR     │ │ (VLM)  │      │          │
     │     │(SaaS)  │ │ (SaaS)  │ │        │      │          │
     │     └────────┘ └──────────┘ └────────┘      │          │
     │          │           │            │          │          │
     │     ┌────────┐ ┌──────────┐ ┌────────┐      │          │
     │     │Textract│ │Google    │ │ Azure  │      │          │
     │     │ (AWS)  │ │DocAI     │ │DocInt  │      │          │
     │     │        │ │ (GCP)    │ │        │      │          │
     │     └────────┘ └──────────┘ └────────┘      │          │
     └──────────┴───────┴─────────────┴─────────────┴──────────┘
                                   │
                          ┌────────▼───────┐
                          │  EngineResult  │
                          │  (unified)     │
                          └────────────────┘

Engine Selection Logic

When no engine is explicitly specified, the router selects one automatically:

Explicit hint — engine_hint="docling" in the call
Environment default — ENGINE_DEFAULT=docling env var
Extension-aware priority — each file type has its own engine priority chain (e.g., .png prefers PaddleOCR, .pdf prefers Docling, .docx skips PDF-only engines)
User-configurable — override with fallback_order or restrict with allowed_engines

# Restrict to specific engines
router = EngineRouter(engines, allowed_engines={"docling", "pymupdf"})

# Custom fallback order
router = EngineRouter(engines, fallback_order=["pymupdf", "docling", "marker"])

# CLI: --engines flag
# docfold convert invoice.pdf --engines docling,pymupdf

Adding a Custom Engine

Implement the DocumentEngine interface:

from docfold.engines.base import DocumentEngine, EngineResult, OutputFormat

class MyEngine(DocumentEngine):
    @property
    def name(self) -> str:
        return "my_engine"

    @property
    def supported_extensions(self) -> set[str]:
        return {"pdf", "docx"}

    def is_available(self) -> bool:
        try:
            import my_library
            return True
        except ImportError:
            return False

    async def process(self, file_path, output_format=OutputFormat.MARKDOWN, **kwargs):
        # Your extraction logic here
        content = extract(file_path)
        return EngineResult(
            content=content,
            format=output_format,
            engine_name=self.name,
        )

# Register it
router.register(MyEngine())

Related Projects

Docfold builds on and integrates with these excellent projects:

Project	Description
Docling	IBM's document conversion toolkit — PDF, DOCX, PPTX, and more
MinerU / PDF-Extract-Kit	End-to-end PDF structuring with layout analysis and formula recognition
Marker	High-quality PDF to Markdown converter
PyMuPDF	Fast PDF/XPS/EPUB processing library
PaddleOCR	Multilingual OCR toolkit (80+ languages)
Tesseract	Open-source OCR engine (100+ languages)
Unstructured	ETL toolkit for diverse document types
LlamaParse	LLM-powered document parsing
Mistral OCR	Vision LLM document understanding
Zerox	Model-agnostic Vision LLM OCR
Nougat	Meta's academic PDF to Markdown model
Surya	Multilingual OCR + layout analysis

Built by

Project	Description
Datatera.ai	AI-powered data transformation and document processing platform
Orquesta AI	AI orchestration and agent management platform
AI Agent Labs	AI agent services and location-based intelligence

Development

git clone https://github.com/mihailorama/docfold.git
cd docfold
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check src/ tests/
mypy src/

See CONTRIBUTING.md for detailed guidelines.

License

MIT. See LICENSE.

Note: Some engine backends have their own licenses (AGPL-3.0 for PyMuPDF and MinerU, GPL-3.0 for Surya, SaaS terms for Marker/LlamaParse/Mistral). Docfold itself is MIT — the engine adapters are optional extras that you install separately.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docs		docs
src/docfold		src/docfold
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docfold

Engine Comparison

How to Choose

Why Docfold?

Supported Engines

Installation

CLI

Batch Processing

Unified Result Format

Evaluation Framework

Architecture

Engine Selection Logic

Adding a Custom Engine

Related Projects

Built by

Development

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Mihailorama/docfold

Folders and files

Latest commit

History

Repository files navigation

Docfold

Engine Comparison

How to Choose

Why Docfold?

Supported Engines

Installation

CLI

Batch Processing

Unified Result Format

Evaluation Framework

Architecture

Engine Selection Logic

Adding a Custom Engine

Related Projects

Built by

Development

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages