Local PDF Knowledge Extraction & RAG Pipeline
Transform Salesforce documentation into searchable, AI-ready knowledge
Features โข Quick Start โข Usage โข Architecture โข Configuration
|
No API calls, no cloud dependencies. Your documents stay on your machine. Process sensitive internal documentation with confidence. PyMuPDF extracts 90% of pages in milliseconds. sqlite-vec provides sub-50ms vector search at scale. |
Combines semantic understanding (vector similarity) with keyword matching (FTS5) using Reciprocal Rank Fusion. Generate clean markdown with citations, organized by skill. Perfect for PRs to your documentation repos. |
# macOS (Homebrew)
brew install tesseract uv
# Or install uv via pip
pip install uv# Clone the repository
git clone https://github.com/Jaganpro/sf-knowledge-tools.git
cd sf-knowledge-tools
# Install dependencies (creates .venv automatically)
uv sync๐ก First Run: The embedding model (~1.3GB) downloads automatically on first use.
sf-knowledge ingest ~/Documents/salesforce-apex-guide.pdf --category apex๐ Example Output
๐ Ingesting: salesforce-apex-guide.pdf
Extracting PDF... โโโโโโโโโโโโโโโโโโโโ 100%
Chunking content... โโโโโโโโโโโโโโโโโโโโ 100%
Generating embeddings (450 chunks)... โโโโโโโโโโโโโโโโโโโโ 100%
Storing chunks... โโโโโโโโโโโโโโโโโโโโ 100%
โ
Successfully ingested: Salesforce Apex Developer Guide
Document ID a1b2c3d4
Pages 234
Chunks 450
Category apex
sf-knowledge query "How do I handle governor limits in batch Apex?"๐ Example Output
๐ Searching: How do I handle governor limits in batch Apex?
Found 5 results in 45.2ms
โญโโโโโโโโโโโโโโโโโโโโโโโ ๐ Result 1 (score: 0.892) โโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Governor limits are enforced at runtime. In batch Apex, each execute โ
โ method invocation gets a fresh set of limits. To avoid hitting limits: โ
โ โ
โ 1. Use Database.Stateful to maintain state across batches โ
โ 2. Keep batch size manageable (default 200, reduce if needed) โ
โ 3. Use Database.executeBatch() with scope parameter โ
โฐโโโโโโโโโโโโโโโโโโ Chapter: Batch Apex | p. 145 โโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
sf-knowledge export "Apex Governor Limits" --skill sf-apexโ
Exported to: exports/sf-apex/apex-governor-limits.md
sf-knowledge status๐ Knowledge Base Status
Database
Location data/knowledge.db
Size 24.5 MB
Content
Documents 3
Chunks 2,450
Embeddings 2,450
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ sf-knowledge-tools โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ PDF Input โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ PDF Extractor โโโโโถโ Semantic Chunkerโโโโโถโ Embedding Clientโ โ
โ โ PyMuPDF + OCR โ โ ~1000 tokens โ โ BGE-large-v1.5 โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ SQLite + sqlite-vec โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Documents โ โ Chunks โ โ Vector Embeddings โ โ โ
โ โ โ (meta) โ โ (text) โ โ (1024-dim, normalized) โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ RAG Engine โโโโโถโ Exporter โโโโโถโ ๐ Markdown โ โ
โ โ Hybrid Search โ โ Jinja2 + Cites โ โ (by skill) โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Component | Technology | Purpose |
|---|---|---|
| PDF Extraction | PyMuPDF + pdfplumber | Fast text extraction, table handling, OCR fallback |
| Chunking | Rule-based | ~1000 tokens, respects headers/code blocks |
| Embeddings | BAAI/bge-large-en-v1.5 | 1024-dim vectors, top MTEB retrieval model |
| Storage | SQLite + sqlite-vec | Single-file DB with vector similarity search |
| Search | Hybrid (Vector + FTS5) | Reciprocal Rank Fusion for best results |
| Export | Jinja2 | Templated markdown with citations |
sf-knowledge-tools/
โโโ ๐ knowledge/ # Core Python library
โ โโโ ingester/ # PDF extraction
โ โโโ chunker/ # Semantic chunking
โ โโโ embedder/ # Embedding generation
โ โโโ storage/ # Vector store & schema
โ โโโ query/ # RAG engine
โ โโโ export/ # Markdown generation
โ โโโ cli.py # Command-line interface
โโโ ๐ config/
โ โโโ knowledge.yml # Configuration
โโโ ๐ pdfs/ # Your source PDFs (gitignored)
โโโ ๐ data/ # SQLite database (gitignored)
โโโ ๐ exports/ # Generated markdown (gitignored)
โโโ ๐ pyproject.toml # Dependencies (uv)
โโโ ๐ LICENSE # MIT License
Edit config/knowledge.yml to customize behavior:
embeddings:
model: BAAI/bge-large-en-v1.5 # HuggingFace model
dimensions: 1024
batch_size: 32
chunking:
target_size: 1000 # Target tokens per chunk
max_size: 1500 # Maximum tokens
overlap: 100 # Overlap between chunks
search:
default_k: 5 # Results to return
hybrid_weight: 0.7 # Vector vs keyword balanceโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ 1. INGEST โโโโโโถโ 2. QUERY โโโโโโถโ 3. EXPORT โโโโโโถโ 4. PR โ
โ โ โ โ โ โ โ โ
โ Add PDFs to โ โ Search your โ โ Generate โ โ Copy to your โ
โ knowledge DB โ โ knowledge โ โ markdown โ โ docs repo โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Built with โค๏ธ for the Salesforce developer community