Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models with a type-safe TypeScript SDK.
- Best-in-class OCR - PaddleOCR-VL-0.9B for accurate text extraction
- Structured extraction - Define schemas, get JSON back
- Built for scale - Queue-based processing for thousands of documents
- Type-safe SDK - Full TypeScript support with React hooks
- Real-time updates - WebSocket notifications for job progress
- Self-hostable - Run on your own infrastructure
npm install ocrbaseimport { createClient } from "ocrbase";
const { parse, extract } = createClient({
baseUrl: "https://your-instance.com",
apiKey: "ak_xxx",
});
// Parse document to markdown
const job = await parse({ file: document });
console.log(job.markdownResult);
// Extract structured data
const job = await extract({
file: invoice,
hints: "invoice number, date, total, line items",
});
console.log(job.jsonResult);See SDK documentation for React hooks and advanced usage.
See Self-Hosting Guide for deployment instructions.
Requirements: Docker, Bun
MIT - See LICENSE for details.
For API access, on-premise deployment, or questions: adammajcher20@gmail.com