Skip to content

📄 PDF ->.MD/.JSON API & SDK for PaddleOCR-VL with structured data extraction. Self-hostable.

License

Notifications You must be signed in to change notification settings

majcheradam/ocrbase

Repository files navigation

ocrbase

Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models with a type-safe TypeScript SDK.

Features

  • Best-in-class OCR - PaddleOCR-VL-0.9B for accurate text extraction
  • Structured extraction - Define schemas, get JSON back
  • Built for scale - Queue-based processing for thousands of documents
  • Type-safe SDK - Full TypeScript support with React hooks
  • Real-time updates - WebSocket notifications for job progress
  • Self-hostable - Run on your own infrastructure

Quick Start

npm install ocrbase
import { createClient } from "ocrbase";

const { parse, extract } = createClient({
  baseUrl: "https://your-instance.com",
  apiKey: "ak_xxx",
});

// Parse document to markdown
const job = await parse({ file: document });
console.log(job.markdownResult);

// Extract structured data
const job = await extract({
  file: invoice,
  hints: "invoice number, date, total, line items",
});
console.log(job.jsonResult);

See SDK documentation for React hooks and advanced usage.

Self-Hosting

See Self-Hosting Guide for deployment instructions.

Requirements: Docker, Bun

Architecture

Architecture Diagram

License

MIT - See LICENSE for details.

Contact

For API access, on-premise deployment, or questions: adammajcher20@gmail.com

About

📄 PDF ->.MD/.JSON API & SDK for PaddleOCR-VL with structured data extraction. Self-hostable.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •