A production-ready, modular OCR system for extracting text from PDFs and images using Tesseract OCR.
- π PDF to Image Conversion - Convert PDFs to images with configurable DPI
- π OCR Text Extraction - Extract text from images with Tesseract
- π Confidence Scoring - Get confidence scores and filter low-quality results
- π Bounding Boxes - Extract word positions and coordinates
- π― Auto-Detection - Automatically detect and process PDF or image files
- β‘ High Performance - Efficient processing with PIL and optimized workflows
- π§ͺ Fully Tested - 104 tests (52 unit + 52 integration) with pytest
- π¦ Modular Design - Clean separation of services and parsers
- Installation
- Quick Start
- Architecture
- Usage Examples
- API Reference
- Testing
- Configuration
- Project Structure
- Python 3.10+
- Tesseract OCR
- Poppler (for PDF support)
- Windows: Download from UB Mannheim's Tesseract
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
- Windows: Download from Poppler for Windows
- macOS:
brew install poppler - Linux:
sudo apt-get install poppler-utils
# Clone the repository
git clone <your-repo-url>
cd python-tesseract
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtfrom src.parsers.ocr_pipeline import OCRPipeline
# Initialize pipeline
pipeline = OCRPipeline(
tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe', # Windows
tessdata_prefix=r'C:\Program Files\Tesseract-OCR\tessdata', # Windows
dpi=300,
lang='eng'
)
# Extract text from image
text = pipeline.process_image("invoice.png")
print(text)
# Extract text from PDF (single page)
text = pipeline.process_pdf("document.pdf", page_number=1)
print(text)
# Auto-detect file type
text = pipeline.process("file.pdf") # or "file.png"
print(text)# Extract only high-confidence words
result = pipeline.process_image("invoice.png", min_confidence=70.0)
print(f"Total words: {result['total_words']}")
print(f"Average confidence: {result['avg_confidence']:.2f}%")
print(f"Text: {result['full_text']}")
# Access individual words with positions
for word in result['words']:
print(f"{word['text']} ({word['confidence']}%) at {word['bbox']}")src/
βββ services/
β βββ pdf_service.py # PDF β Image conversion
βββ parsers/
β βββ ocr_parser.py # Image β Text extraction
β βββ ocr_pipeline.py # Main orchestrator
βββ __init__.py
tests/
βββ parsers/
β βββ test_ocr_parser.py # Integration tests
β βββ test_ocr_parser_unit.py # Unit tests with mocks
β βββ test_ocr_pipeline.py # Integration tests
β βββ test_ocr_pipeline_unit.py # Unit tests with mocks
βββ services/
βββ test_pdf_service.py # Integration tests
βββ test_pdf_service_unit.py # Unit tests with mocks
examples/
βββ basic_usage.py # Basic examples
βββ advanced_usage.py # Advanced features
βββ pdf_conversion.py # PDF conversion examples
1. PDFToImageService (src/services/pdf_service.py)
- Converts PDFs to images using pdf2image/Poppler
- Configurable DPI and page ranges
- Save images to disk
2. TesseractOCRParser (src/parsers/ocr_parser.py)
- Extracts text from images using Tesseract
- Provides confidence scores and bounding boxes
- Supports multiple extraction modes
3. OCRPipeline (src/parsers/ocr_pipeline.py)
- Main orchestrator combining services
- Unified interface for PDF and image processing
- Auto-detection of file types
from src.parsers.ocr_pipeline import OCRPipeline
pipeline = OCRPipeline()
# Simple text extraction
text = pipeline.process_image("document.png")
# Get detailed OCR data
data = pipeline.process_image("document.png", extract_data=True)
print(f"Text elements: {data['text']}")
print(f"Confidence: {data['conf']}")
print(f"Positions: {list(zip(data['left'], data['top']))}")
# Filter by confidence
result = pipeline.process_image("document.png", min_confidence=80.0)
print(f"High-confidence words: {result['total_words']}")# Process single page
text = pipeline.process_pdf("document.pdf", page_number=1)
# Process all pages
pages = pipeline.process_pdf("document.pdf")
for i, text in enumerate(pages, 1):
print(f"Page {i}: {len(text)} characters")
# Process with confidence filtering
result = pipeline.process_pdf(
"document.pdf",
page_number=1,
min_confidence=70.0
)# PDF to Image conversion
from src.services.pdf_service import PDFToImageService
service = PDFToImageService(dpi=300)
images = service.convert_pdf_to_images("document.pdf")
service.save_images(images, "output", base_name="page", format="PNG")
# OCR extraction
from src.parsers.ocr_parser import TesseractOCRParser
parser = TesseractOCRParser()
text = parser.extract_text("image.png")
data = parser.extract_data("image.png")Check the examples/ directory for complete working examples:
basic_usage.py- Basic OCR operationsadvanced_usage.py- Advanced features and confidence filteringpdf_conversion.py- PDF to image conversion
OCRPipeline(
tesseract_cmd: Optional[str] = None,
tessdata_prefix: Optional[str] = None,
poppler_path: Optional[str] = None,
dpi: int = 300,
lang: str = 'eng'
)Parameters:
tesseract_cmd: Path to Tesseract executabletessdata_prefix: Path to tessdata directorypoppler_path: Path to Poppler binariesdpi: DPI for PDF to image conversion (default: 300)lang: Language code for OCR (default: 'eng')
process_image()
process_image(
image_source: Union[str, Path, Image.Image],
extract_data: bool = False,
min_confidence: Optional[float] = None
) -> Union[str, Dict[str, Any]]process_pdf()
process_pdf(
pdf_path: Union[str, Path],
extract_data: bool = False,
page_number: Optional[int] = None,
min_confidence: Optional[float] = None
) -> Union[str, List[str], Dict[str, Any], List[Dict[str, Any]]]process()
process(
input_path: Union[str, Path],
extract_data: bool = False,
page_number: Optional[int] = None,
min_confidence: Optional[float] = None
) -> Union[str, List[str], Dict[str, Any], List[Dict[str, Any]]]Simple Text:
text: str = "Extracted text..."Detailed Data:
data: Dict[str, Any] = {
'text': ['word1', 'word2', ...],
'conf': [95, 87, ...],
'left': [10, 50, ...],
'top': [20, 25, ...],
'width': [40, 35, ...],
'height': [15, 15, ...],
}Confidence Filtered:
result: Dict[str, Any] = {
'full_text': 'Complete text...',
'words': [
{
'text': 'word',
'confidence': 95,
'bbox': {'left': 10, 'top': 20, 'width': 40, 'height': 15}
},
],
'avg_confidence': 87.5,
'total_words': 150
}The project includes a comprehensive test suite with 104 tests:
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test files
pytest tests/parsers/test_ocr_parser_unit.py -v
pytest tests/parsers/test_ocr_pipeline_unit.py -v
pytest tests/services/test_pdf_service_unit.py -v
# Run integration tests (requires Tesseract installed)
pytest tests/parsers/test_ocr_parser.py -v
pytest tests/parsers/test_ocr_pipeline.py -v
pytest tests/services/test_pdf_service.py -v- 52 Unit Tests - Fast, isolated tests with mocks
- 52 Integration Tests - End-to-end tests with real dependencies
- 100% Code Coverage - All major functions tested
Method 1: In code
pipeline = OCRPipeline(
tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe',
tessdata_prefix=r'C:\Program Files\Tesseract-OCR\tessdata'
)Method 2: Environment variables
export TESSERACT_CMD="/usr/bin/tesseract"
export TESSDATA_PREFIX="/usr/share/tesseract-ocr/tessdata"pipeline = OCRPipeline(
poppler_path=r'C:\poppler\Library\bin' # Windows
)# Use different language
pipeline = OCRPipeline(lang='fra') # French
pipeline = OCRPipeline(lang='deu') # German
pipeline = OCRPipeline(lang='spa') # Spanish
# Multiple languages
pipeline = OCRPipeline(lang='eng+fra') # English + Frenchfrom src.parsers.ocr_pipeline import OCRPipeline
from src.parsers.ocr_parser import OCRError
from src.services.pdf_service import PDFConversionError
try:
pipeline = OCRPipeline()
text = pipeline.process("document.pdf")
except FileNotFoundError as e:
print(f"File not found: {e}")
except PDFConversionError as e:
print(f"PDF conversion failed: {e}")
except OCRError as e:
print(f"OCR processing failed: {e}")python-tesseract/
βββ .gitignore # Git ignore rules
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ pytest.ini # Pytest configuration
β
βββ src/ # Source code
β βββ __init__.py
β βββ parsers/
β β βββ __init__.py
β β βββ ocr_parser.py # OCR text extraction
β β βββ ocr_pipeline.py # Main orchestrator
β βββ services/
β βββ __init__.py
β βββ pdf_service.py # PDF conversion
β
βββ tests/ # Test suite (104 tests)
β βββ conftest.py
β βββ parsers/
β β βββ test_ocr_parser.py
β β βββ test_ocr_parser_unit.py
β β βββ test_ocr_pipeline.py
β β βββ test_ocr_pipeline_unit.py
β βββ services/
β βββ test_pdf_service.py
β βββ test_pdf_service_unit.py
β
βββ examples/ # Usage examples
βββ README.md
βββ basic_usage.py
βββ advanced_usage.py
βββ pdf_conversion.py
- Invoice Processing - Extract data from scanned invoices
- Document Digitization - Convert PDFs to searchable text
- Form Processing - Extract fields from forms and surveys
- Receipt Parsing - Read and process receipt data
- Business Card Scanning - Extract contact information
- Archive Digitization - Convert old documents to searchable text
- Python 3.10+
- pytesseract
- Pillow (PIL)
- pdf2image
- Tesseract OCR (system install)
- Poppler (for PDF support)
See requirements.txt for complete dependencies.
[Your License Here]
This is a production-ready template. Feel free to extend and customize for your needs.
For issues and questions, please open an issue on the repository.