Skip to content

zcyh147/DocWeaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocWeaver

A document processing engine designed for RAG (Retrieval-Augmented Generation) pipelines.

DocWeaver analyzes and splits large CHM and PDF documents into smaller chunks based on token count limits, making them suitable for AI model processing.

For detailed usage instructions, use cases, and the problems this tool solves, please refer to my technical article: https://xiye.art/posts/2025-01-15-rag-knowledge-base/

Features

  • Multi-format Support: Process CHM and PDF documents
  • Token-based Analysis: Analyze document token counts with split recommendations
  • Intelligent Splitting: Split large documents into multiple PDFs based on configurable token limits
  • Dual Interface: GUI for drag-and-drop convenience, CLI for automation and batch processing
  • Cross-platform: Works on Windows, macOS

Quick Start

Standalone

Download the standalone executable from the Releases page:

  • Windows: DocWeaver-Windows.zip
  • macOS: DocWeaver-macOS.zip

Extract and run the executable directly - no installation required.

Usage

GUI Mode (Standalone)

Simply double-click the executable to launch the graphical interface:

  1. Drag and drop CHM/PDF files into the window
  2. Set the maximum token limit (default: 500,000)
  3. Click "Process" to analyze or split files
  4. View results and access output files

Script Version

Prerequisites

  1. Python 3.11

  2. Poetry - Install following the official guide

  3. System Dependencies:

    Windows:

    • Install 7-Zip
    • Install GTK3 Runtime
    • Add 7-Zip to PATH: C:\Program Files\7-Zip\ (or your installation path)

    macOS:

    brew install p7zip cairo pango gdk-pixbuf

Setup

  1. Clone the repository:

    git clone https://github.com/zcyh147/DocWeaver.git
    cd DocWeaver
  2. Install dependencies with Poetry:

    poetry install

Running the Script

GUI Mode

Windows:

# Launch GUI interface
.\run.bat --gui

macOS:

# Launch GUI interface
./run.sh --gui
# or use short form
./run.sh -g
CLI Mode

Windows:

# Analyze a document (shows token count and split recommendations)
.\run.bat -i document.chm

# Split a PDF into multiple files
.\run.bat -i document.pdf -s

# Split with custom output directory and token limit
.\run.bat -i document.pdf -s -o ./output -m 500000

macOS:

# Analyze a document
./run.sh -i document.chm

# Split a PDF into multiple files
./run.sh -i document.pdf -s

# Split with custom output directory and token limit
./run.sh -i document.pdf -s -o ./output -m 500000

Command-line Options

-g, --gui              Launch GUI mode (ignores other arguments)
-i, --input-file       Input file to process (CHM or PDF)
-o, --output-dir       Output directory for split PDFs (default: ./output)
-m, --max-tokens       Maximum tokens per PDF (default: 500000)
-s, --split            Enable file splitting

Run .\run.bat --help (Windows) or ./run.sh --help (macOS) for detailed usage information.

Running Tests

To run the test suite with coverage:

Windows:

ci\test.bat

macOS:

./ci/test.sh

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A document processing engine for RAG pipelines

Resources

License

Stars

Watchers

Forks

Packages

No packages published