file2text is a powerful, containerized REST API service for extracting text from various document formats, optimized for integration with LLMs, n8n workflows, and other automation tools.
- Multi-format Support: Extract text from PDFs, DOC/DOCX, and various image formats
- Smart OCR Processing: Automatically uses OCR for scanned documents and images
- Multiple Input Methods: Upload files directly or send base64-encoded data
- Batch Processing: Process multiple files in a single request
- LLM-Ready Output: Clean, structured JSON responses ready for LLM processing
- n8n Integration: Seamless integration with n8n workflows
- Docker Containerized: Easy deployment in any environment
- Secure API: Protected endpoints with API key authentication
- Multilingual Support: Built-in English and Hebrew text recognition
- FastAPI: High-performance Python web framework
- Tesseract OCR: Optical Character Recognition for images and scanned PDFs
- PyPDF2 & pdf2image: PDF processing and conversion
- python-docx: Microsoft Word document processing
- Docker & Docker Compose: Containerization and orchestration
- Docker & Docker Compose
- 2GB+ RAM recommended for optimal OCR performance
- Sufficient disk space for temporary file processing
file2text utilizes a combination of direct text extraction and OCR technologies to process your documents:
- PDFs: First attempts direct text extraction, falls back to OCR for scanned documents
- DOC/DOCX: Extracts text from Microsoft Word documents (including legacy .doc via LibreOffice)
- Images: Uses Tesseract OCR with image pre-processing for optimal text recognition
- All extracted text is returned in clean, structured JSON format
-
Clone the repository:
git clone https://github.com/Aviadg/file2text.git cd file2text -
Create a
.envfile with your API key:echo "API_KEY=your-secure-api-key-here" > .env
-
Start the service:
docker-compose up -d
-
The API will be available at http://localhost:8000
- Clone the repository
- Create
.envfile with your API key - Customize the Dockerfile or docker-compose.yml if needed
- Build and start the service:
docker-compose build docker-compose up -d
Once the service is running, access the interactive API documentation at http://localhost:8000/docs
All API endpoints require an API key for authentication. Include it in your requests using the X-API-Key header:
X-API-Key: your-secure-api-key-hereGET /- Service status checkPOST /extract-text/- Extract text from uploaded filePOST /batch-extract/- Process multiple uploaded filesPOST /extract-text-base64/- Extract text from base64-encoded file
curl -X POST \
http://localhost:8000/extract-text/ \
-H "X-API-Key: your-secure-api-key-here" \
-H "accept: application/json" \
-F "file=@/path/to/your/document.pdf"{
"filename": "document.pdf",
"text": "This is the extracted text content from the PDF file...",
"file_type": "pdf"
}file2text is optimized for use with n8n workflows. Here's how to connect:
- In n8n, add an HTTP Request node
- Configure it to send a POST request to
http://yourhostip:8000/extract-text-base64/ - Add the
X-API-Keyheader with your API key - Set the body to JSON with this structure:
{ "filename": "{{$node[\"Read Binary File\"].binary.data.fileName}}", "content_type": "{{$node[\"Read Binary File\"].binary.data.mimeType}}", "base64_data": "{{$binary.data.toString('base64')}}" }
If running both n8n and file2text in Docker containers, use one of these approaches:
-
Use the host's IP address in the HTTP Request URL:
http://YOUR_HOST_IP:8000/extract-text-base64/ -
Create a shared Docker network:
docker network create shared-network
Then update both docker-compose files to use this network and reference the service by name:
http://text-extraction-api:8000/extract-text-base64/
file2text/
│
├── docker-compose.yml # Main docker-compose configuration
├── Dockerfile # Docker image configuration
├── .env # Environment variables including API key
│
├── app/ # Main application directory
│ ├── requirements.txt # Python dependencies
│ ├── main.py # FastAPI main application file
│ │
│ └── extractors/ # Package for text extraction modules
│ ├── __init__.py # Package initialization
│ ├── pdf_extractor.py # PDF extraction module
│ ├── doc_extractor.py # DOC/DOCX extraction module
│ └── image_extractor.py # Image extraction module
│
└── uploads/ # Directory for temporary file uploads
- Set a strong, random API key in production
- The service processes files temporarily and automatically removes them after extraction
- Use HTTPS in production environments
- Consider implementing rate limiting for additional security
- Monitor API usage and implement appropriate access controls
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Project Link: https://github.com/Aviadg/file2text
Made with ❤️ by Aviad