A document processing engine designed for RAG (Retrieval-Augmented Generation) pipelines.
DocWeaver analyzes and splits large CHM and PDF documents into smaller chunks based on token count limits, making them suitable for AI model processing.
For detailed usage instructions, use cases, and the problems this tool solves, please refer to my technical article: https://xiye.art/posts/2025-01-15-rag-knowledge-base/
- Multi-format Support: Process CHM and PDF documents
- Token-based Analysis: Analyze document token counts with split recommendations
- Intelligent Splitting: Split large documents into multiple PDFs based on configurable token limits
- Dual Interface: GUI for drag-and-drop convenience, CLI for automation and batch processing
- Cross-platform: Works on Windows, macOS
Download the standalone executable from the Releases page:
- Windows:
DocWeaver-Windows.zip - macOS:
DocWeaver-macOS.zip
Extract and run the executable directly - no installation required.
Simply double-click the executable to launch the graphical interface:
- Drag and drop CHM/PDF files into the window
- Set the maximum token limit (default: 500,000)
- Click "Process" to analyze or split files
- View results and access output files
-
Python 3.11
-
Poetry - Install following the official guide
-
System Dependencies:
Windows:
- Install 7-Zip
- Install GTK3 Runtime
- Add 7-Zip to PATH:
C:\Program Files\7-Zip\(or your installation path)
macOS:
brew install p7zip cairo pango gdk-pixbuf
-
Clone the repository:
git clone https://github.com/zcyh147/DocWeaver.git cd DocWeaver -
Install dependencies with Poetry:
poetry install
Windows:
# Launch GUI interface
.\run.bat --guimacOS:
# Launch GUI interface
./run.sh --gui
# or use short form
./run.sh -gWindows:
# Analyze a document (shows token count and split recommendations)
.\run.bat -i document.chm
# Split a PDF into multiple files
.\run.bat -i document.pdf -s
# Split with custom output directory and token limit
.\run.bat -i document.pdf -s -o ./output -m 500000macOS:
# Analyze a document
./run.sh -i document.chm
# Split a PDF into multiple files
./run.sh -i document.pdf -s
# Split with custom output directory and token limit
./run.sh -i document.pdf -s -o ./output -m 500000-g, --gui Launch GUI mode (ignores other arguments)
-i, --input-file Input file to process (CHM or PDF)
-o, --output-dir Output directory for split PDFs (default: ./output)
-m, --max-tokens Maximum tokens per PDF (default: 500000)
-s, --split Enable file splittingRun .\run.bat --help (Windows) or ./run.sh --help (macOS) for detailed usage information.
To run the test suite with coverage:
Windows:
ci\test.batmacOS:
./ci/test.shThis project is licensed under the MIT License - see the LICENSE file for details.