DocWeaver

A document processing engine designed for RAG (Retrieval-Augmented Generation) pipelines.

DocWeaver analyzes and splits large CHM and PDF documents into smaller chunks based on token count limits, making them suitable for AI model processing.

For detailed usage instructions, use cases, and the problems this tool solves, please refer to my technical article: https://xiye.art/posts/2025-01-15-rag-knowledge-base/

Features

Multi-format Support: Process CHM and PDF documents
Token-based Analysis: Analyze document token counts with split recommendations
Intelligent Splitting: Split large documents into multiple PDFs based on configurable token limits
Dual Interface: GUI for drag-and-drop convenience, CLI for automation and batch processing
Cross-platform: Works on Windows, macOS

Quick Start

Standalone

Download the standalone executable from the Releases page:

Windows: DocWeaver-Windows.zip
macOS: DocWeaver-macOS.zip

Extract and run the executable directly - no installation required.

Usage

GUI Mode (Standalone)

Simply double-click the executable to launch the graphical interface:

Drag and drop CHM/PDF files into the window
Set the maximum token limit (default: 500,000)
Click "Process" to analyze or split files
View results and access output files

Script Version

Prerequisites

Python 3.11
Poetry - Install following the official guide
System Dependencies:

Windows:
- Install 7-Zip
- Install GTK3 Runtime
- Add 7-Zip to PATH: C:\Program Files\7-Zip\ (or your installation path)
macOS:
```
brew install p7zip cairo pango gdk-pixbuf
```

Setup

Clone the repository:

git clone https://github.com/zcyh147/DocWeaver.git
cd DocWeaver

Install dependencies with Poetry:
```
poetry install
```

Running the Script

GUI Mode

Windows:

# Launch GUI interface
.\run.bat --gui

macOS:

# Launch GUI interface
./run.sh --gui
# or use short form
./run.sh -g

CLI Mode

Windows:

# Analyze a document (shows token count and split recommendations)
.\run.bat -i document.chm

# Split a PDF into multiple files
.\run.bat -i document.pdf -s

# Split with custom output directory and token limit
.\run.bat -i document.pdf -s -o ./output -m 500000

macOS:

# Analyze a document
./run.sh -i document.chm

# Split a PDF into multiple files
./run.sh -i document.pdf -s

# Split with custom output directory and token limit
./run.sh -i document.pdf -s -o ./output -m 500000

Command-line Options

-g, --gui              Launch GUI mode (ignores other arguments)
-i, --input-file       Input file to process (CHM or PDF)
-o, --output-dir       Output directory for split PDFs (default: ./output)
-m, --max-tokens       Maximum tokens per PDF (default: 500000)
-s, --split            Enable file splitting

Run .\run.bat --help (Windows) or ./run.sh --help (macOS) for detailed usage information.

Running Tests

To run the test suite with coverage:

Windows:

ci\test.bat

macOS:

./ci/test.sh

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
ci		ci
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
run.bat		run.bat
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocWeaver

Features

Quick Start

Standalone

Usage

GUI Mode (Standalone)

Script Version

Prerequisites

Setup

Running the Script

GUI Mode

CLI Mode

Command-line Options

Running Tests

License

About

Uh oh!

Releases 1

Packages

Languages

License

zcyh147/DocWeaver

Folders and files

Latest commit

History

Repository files navigation

DocWeaver

Features

Quick Start

Standalone

Usage

GUI Mode (Standalone)

Script Version

Prerequisites

Setup

Running the Script

GUI Mode

CLI Mode

Command-line Options

Running Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages