Turn any document into structured data. Unified Python toolkit for document structuring — one interface, 16 engines, built-in benchmarks.
Read the announcement: Docfold - open-source document processing toolkit
Research-based estimates from public benchmarks, documentation, and community reports. See detailed methodology. Run your own:
docfold compare your_doc.pdf
| Engine | docfold | Type | License | Text PDF | Scan/OCR | Tables | BBox | Conf | Speed | Cost |
|---|---|---|---|---|---|---|---|---|---|---|
| Docling | ✅ | Local | MIT | ★★★ | ★★☆ | ★★★ | ✅ | — | Medium | Free |
| MinerU | ✅ | Local | AGPL | ★★★ | ★★★ | ★★★ | — | — | Slow | Free |
| Marker | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | — | Fast | $$ |
| PyMuPDF | ✅ | Local | AGPL | ★★★ | ☆☆☆ | ★☆☆ | — | — | Ultra | Free |
| PaddleOCR | ✅ | Local | Apache | ★☆☆ | ★★★ | ★★☆ | — | ✅ | Medium | Free |
| Tesseract | ✅ | Local | Apache | ★☆☆ | ★★☆ | ★☆☆ | — | — | Medium | Free |
| EasyOCR | ✅ | Local | Apache | ★☆☆ | ★★★ | ☆☆☆ | — | ✅ | Medium | Free |
| Unstructured | ✅ | Local | Apache | ★★☆ | ★★☆ | ★★☆ | — | — | Medium | Free |
| LlamaParse | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | — | — | Fast | $$ |
| Mistral OCR | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | — | — | Fast | $$ |
| Zerox | ✅ | VLM | MIT | ★★★ | ★★★ | ★★☆ | — | — | Slow | $$$ |
| AWS Textract | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | ✅ | Fast | $$ |
| Google Doc AI | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | ✅ | Fast | $$ |
| Azure Doc Intel | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | ✅ | Fast | $$ |
| Nougat | ✅ | Local | MIT | ★★★ | ★★☆ | ★★☆ | — | — | Slow | Free |
| Surya | ✅ | Local | GPL | ★★☆ | ★★★ | ★★☆ | ✅ | ✅ | Medium | Free |
★★★ Excellent ★★☆ Good ★☆☆ Basic ☆☆☆ Not supported — $$ ~$1-3/1K pages $$$ ~$5-15/1K pages — BBox Bounding boxes — Conf Confidence scores
Full engine profiles, format matrix, hardware requirements, and cost breakdown →
| Your situation | Recommended engine |
|---|---|
| Digital PDF, speed is critical | PyMuPDF — zero deps, ~1000 pages/sec |
| Scanned documents, need OCR | PaddleOCR (80+ langs), EasyOCR (PyTorch), or Tesseract (100+ langs) |
| Complex layouts + tables | Docling or MinerU (free), LlamaParse (paid) |
| Academic papers + math formulas | MinerU or Nougat (free), Mistral OCR (paid) |
| Best quality, budget available | Mistral OCR or LlamaParse |
| Use any Vision LLM (GPT-4o, Claude, etc.) | Zerox — model-agnostic |
| Self-hosted, all-in-one ETL | Unstructured with hi_res strategy |
| Diverse file types (not just PDF) | Docling or Unstructured |
| Need bounding boxes + confidence | Textract, Google DocAI, or Azure DocInt |
| Office files (DOCX/PPTX/XLSX) | Docling, Marker, Unstructured, or Azure DocInt |
| AWS/GCP/Azure native pipeline | Textract / Google DocAI / Azure DocInt |
Every engine has trade-offs. Docfold lets you switch between them with one line:
| Challenge | Without Docfold | With Docfold |
|---|---|---|
| Try a new engine | Rewrite your pipeline | Change one string: engine_hint="docling" |
| Compare quality | Manual side-by-side | router.compare("doc.pdf") — one line |
| Batch 1000 files | Build your own concurrency | router.process_batch(files, concurrency=5) |
| Measure accuracy | Write custom metrics | Built-in CER, WER, Table F1, Reading Order |
| Switch engines later | Major refactor | Zero code changes — same EngineResult |
from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
from docfold.engines.pymupdf_engine import PyMuPDFEngine
router = EngineRouter([DoclingEngine(), PyMuPDFEngine()])
# Auto-select the best available engine
result = await router.process("invoice.pdf")
print(result.content) # Markdown output
print(result.engine_name) # Which engine was used
print(result.processing_time_ms)
# Compare all engines on the same document
results = await router.compare("invoice.pdf")
for name, res in results.items():
print(f"{name}: {len(res.content)} chars in {res.processing_time_ms}ms")| Engine | Type | License | Formats | GPU | Install |
|---|---|---|---|---|---|
| Docling | Local | MIT | PDF, DOCX, PPTX, XLSX, HTML, images | No | pip install docfold[docling] |
| MinerU | Local | AGPL-3.0 | Recommended | pip install docfold[mineru] |
|
| Marker API | SaaS | Paid | PDF, Office, images | N/A | pip install docfold[marker] |
| PyMuPDF | Local | AGPL-3.0 | No | pip install docfold[pymupdf] |
|
| PaddleOCR | Local | Apache-2.0 | Images, scanned PDFs | Optional | pip install docfold[paddleocr] |
| Tesseract | Local | Apache-2.0 | Images, scanned PDFs | No | pip install docfold[tesseract] |
| EasyOCR | Local | Apache-2.0 | Images, scanned PDFs | Optional | pip install docfold[easyocr] |
| Unstructured | Local | Apache-2.0 | PDF, Office, HTML, email, ePub | Optional | pip install docfold[unstructured] |
| LlamaParse | SaaS | Paid | PDF, Office, images | N/A | pip install docfold[llamaparse] |
| Mistral OCR | SaaS | Paid | PDF, images | N/A | pip install docfold[mistral-ocr] |
| Zerox | VLM | MIT | PDF, images | Depends | pip install docfold[zerox] |
| AWS Textract | SaaS | Paid | PDF, images | N/A | pip install docfold[textract] |
| Google Doc AI | SaaS | Paid | PDF, images | N/A | pip install docfold[google-docai] |
| Azure Doc Intel | SaaS | Paid | PDF, Office, HTML, images | N/A | pip install docfold[azure-docint] |
| Nougat | Local | MIT (code) | Recommended | pip install docfold[nougat] |
|
| Surya | Local | GPL-3.0 | PDF, images | Optional | pip install docfold[surya] |
Adding your own engine? Implement the
DocumentEngineinterface — see Adding a Custom Engine below.
# Core only (no engines — useful for writing custom adapters)
pip install docfold
# With specific engines
pip install docfold[docling]
pip install docfold[docling,pymupdf,tesseract]
# Everything
pip install docfold[all]Requires Python 3.10+.
# Convert a document
docfold convert invoice.pdf
docfold convert report.pdf --engine docling --format html --output report.html
# List available engines
docfold engines
# Compare engines on a document
docfold compare invoice.pdf
# Run evaluation benchmark
docfold evaluate tests/evaluation/dataset/ --output report.jsonProcess hundreds of documents with bounded concurrency and progress tracking:
from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
router = EngineRouter([DoclingEngine()])
# Simple batch
batch = await router.process_batch(
["invoice1.pdf", "invoice2.pdf", "report.docx"],
concurrency=3,
)
print(f"{batch.succeeded}/{batch.total} succeeded in {batch.total_time_ms}ms")
# With progress callback
def on_progress(*, current, total, file_path, engine_name, status, **_):
print(f"[{current}/{total}] {status}: {file_path} ({engine_name})")
batch = await router.process_batch(
file_paths,
concurrency=5,
on_progress=on_progress,
)
# Access results
for path, result in batch.results.items():
print(f"{path}: {len(result.content)} chars")
# Check errors
for path, error in batch.errors.items():
print(f"FAILED {path}: {error}")Every engine returns the same EngineResult dataclass:
@dataclass
class EngineResult:
content: str # The extracted text (markdown/html/json/text)
format: OutputFormat # markdown | html | json | text
engine_name: str # Which engine produced this
metadata: dict # Engine-specific metadata
pages: int | None # Number of pages processed
images: dict | None # Extracted images {filename: base64}
tables: list | None # Extracted tables
bounding_boxes: list | None # Layout element positions
confidence: float | None # Overall confidence [0-1]
processing_time_ms: int # Wall-clock timeDocfold includes a built-in evaluation harness to objectively compare engines:
pip install docfold[evaluation]
docfold evaluate path/to/dataset/ --engines docling,pymupdf,markerMetrics measured:
| Metric | What it measures | Target |
|---|---|---|
| CER (Character Error Rate) | Character-level text accuracy | < 0.05 |
| WER (Word Error Rate) | Word-level text accuracy | < 0.10 |
| Table F1 | Table detection and cell content accuracy | > 0.85 |
| Heading F1 | Heading detection precision/recall | > 0.90 |
| Reading Order Score | Correctness of reading order (Kendall's tau) | > 0.90 |
See docs/evaluation.md for the ground truth JSON schema and detailed usage.
┌─────────────────────────────┐
│ Your Application │
└──────────┬──────────────────┘
│
┌──────────▼──────────────────┐
│ EngineRouter │
│ select() / process() │
│ process_batch() / compare() │
└──────────┬──────────────────┘
│
┌──────────┬───────┬──────────┴──────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────┐
│Docling │ │ MinerU │ │Unstructd │ │ Marker │ │PyMuPDF │ │ OCR │
│(local) │ │(local) │ │ (local) │ │ (SaaS) │ │(local) │ │Paddle│
└────────┘ └────────┘ └──────────┘ └────────┘ └────────┘ │Tess. │
│ │ │ │ │ └──────┘
│ ┌────────┐ ┌──────────┐ ┌────────┐ │ │
│ │Llama │ │ Mistral │ │ Zerox │ │ │
│ │Parse │ │ OCR │ │ (VLM) │ │ │
│ │(SaaS) │ │ (SaaS) │ │ │ │ │
│ └────────┘ └──────────┘ └────────┘ │ │
│ │ │ │ │ │
│ ┌────────┐ ┌──────────┐ ┌────────┐ │ │
│ │Textract│ │Google │ │ Azure │ │ │
│ │ (AWS) │ │DocAI │ │DocInt │ │ │
│ │ │ │ (GCP) │ │ │ │ │
│ └────────┘ └──────────┘ └────────┘ │ │
└──────────┴───────┴─────────────┴─────────────┴──────────┘
│
┌────────▼───────┐
│ EngineResult │
│ (unified) │
└────────────────┘
When no engine is explicitly specified, the router selects one automatically:
- Explicit hint —
engine_hint="docling"in the call - Environment default —
ENGINE_DEFAULT=doclingenv var - Extension-aware priority — each file type has its own engine priority chain (e.g.,
.pngprefers PaddleOCR,.pdfprefers Docling,.docxskips PDF-only engines) - User-configurable — override with
fallback_orderor restrict withallowed_engines
# Restrict to specific engines
router = EngineRouter(engines, allowed_engines={"docling", "pymupdf"})
# Custom fallback order
router = EngineRouter(engines, fallback_order=["pymupdf", "docling", "marker"])
# CLI: --engines flag
# docfold convert invoice.pdf --engines docling,pymupdfImplement the DocumentEngine interface:
from docfold.engines.base import DocumentEngine, EngineResult, OutputFormat
class MyEngine(DocumentEngine):
@property
def name(self) -> str:
return "my_engine"
@property
def supported_extensions(self) -> set[str]:
return {"pdf", "docx"}
def is_available(self) -> bool:
try:
import my_library
return True
except ImportError:
return False
async def process(self, file_path, output_format=OutputFormat.MARKDOWN, **kwargs):
# Your extraction logic here
content = extract(file_path)
return EngineResult(
content=content,
format=output_format,
engine_name=self.name,
)
# Register it
router.register(MyEngine())Docfold builds on and integrates with these excellent projects:
| Project | Description |
|---|---|
| Docling | IBM's document conversion toolkit — PDF, DOCX, PPTX, and more |
| MinerU / PDF-Extract-Kit | End-to-end PDF structuring with layout analysis and formula recognition |
| Marker | High-quality PDF to Markdown converter |
| PyMuPDF | Fast PDF/XPS/EPUB processing library |
| PaddleOCR | Multilingual OCR toolkit (80+ languages) |
| Tesseract | Open-source OCR engine (100+ languages) |
| Unstructured | ETL toolkit for diverse document types |
| LlamaParse | LLM-powered document parsing |
| Mistral OCR | Vision LLM document understanding |
| Zerox | Model-agnostic Vision LLM OCR |
| Nougat | Meta's academic PDF to Markdown model |
| Surya | Multilingual OCR + layout analysis |
| Project | Description |
|---|---|
| Datatera.ai | AI-powered data transformation and document processing platform |
| Orquesta AI | AI orchestration and agent management platform |
| AI Agent Labs | AI agent services and location-based intelligence |
git clone https://github.com/mihailorama/docfold.git
cd docfold
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check src/ tests/
mypy src/See CONTRIBUTING.md for detailed guidelines.
MIT. See LICENSE.
Note: Some engine backends have their own licenses (AGPL-3.0 for PyMuPDF and MinerU, GPL-3.0 for Surya, SaaS terms for Marker/LlamaParse/Mistral). Docfold itself is MIT — the engine adapters are optional extras that you install separately.