📚 sf-knowledge-tools

Local PDF Knowledge Extraction & RAG Pipeline
Transform Salesforce documentation into searchable, AI-ready knowledge

Features • Quick Start • Usage • Architecture • Configuration

✨ Features

🔒 100% Offline

No API calls, no cloud dependencies. Your documents stay on your machine. Process sensitive internal documentation with confidence.

⚡ Fast & Efficient

PyMuPDF extracts 90% of pages in milliseconds. sqlite-vec provides sub-50ms vector search at scale.

🎯 Hybrid Search

Combines semantic understanding (vector similarity) with keyword matching (FTS5) using Reciprocal Rank Fusion.

📝 Export Ready

Generate clean markdown with citations, organized by skill. Perfect for PRs to your documentation repos.

🚀 Quick Start

Prerequisites

# macOS (Homebrew)
brew install tesseract uv

# Or install uv via pip
pip install uv

Installation

# Clone the repository
git clone https://github.com/Jaganpro/sf-knowledge-tools.git
cd sf-knowledge-tools

# Install dependencies (creates .venv automatically)
uv sync

💡 First Run: The embedding model (~1.3GB) downloads automatically on first use.

📖 Usage

Ingest a PDF

sf-knowledge ingest ~/Documents/salesforce-apex-guide.pdf --category apex

📋 Example Output

📄 Ingesting: salesforce-apex-guide.pdf
  Extracting PDF...                      ━━━━━━━━━━━━━━━━━━━━ 100%
  Chunking content...                    ━━━━━━━━━━━━━━━━━━━━ 100%
  Generating embeddings (450 chunks)...  ━━━━━━━━━━━━━━━━━━━━ 100%
  Storing chunks...                      ━━━━━━━━━━━━━━━━━━━━ 100%

✅ Successfully ingested: Salesforce Apex Developer Guide
   Document ID  a1b2c3d4
   Pages        234
   Chunks       450
   Category     apex

Search the Knowledge Base

sf-knowledge query "How do I handle governor limits in batch Apex?"

📋 Example Output

🔍 Searching: How do I handle governor limits in batch Apex?

Found 5 results in 45.2ms

╭─────────────────────── 📌 Result 1 (score: 0.892) ───────────────────────╮
│ Governor limits are enforced at runtime. In batch Apex, each execute    │
│ method invocation gets a fresh set of limits. To avoid hitting limits:  │
│                                                                          │
│ 1. Use Database.Stateful to maintain state across batches               │
│ 2. Keep batch size manageable (default 200, reduce if needed)           │
│ 3. Use Database.executeBatch() with scope parameter                     │
╰────────────────── Chapter: Batch Apex | p. 145 ──────────────────────────╯

Export to Markdown

sf-knowledge export "Apex Governor Limits" --skill sf-apex

✅ Exported to: exports/sf-apex/apex-governor-limits.md

Check Status

sf-knowledge status

📊 Knowledge Base Status

     Database
 Location  data/knowledge.db
 Size      24.5 MB

     Content
 Documents   3
 Chunks      2,450
 Embeddings  2,450

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         sf-knowledge-tools                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   📄 PDF Input                                                           │
│       │                                                                  │
│       ▼                                                                  │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐     │
│   │   PDF Extractor │───▶│ Semantic Chunker│───▶│ Embedding Client│     │
│   │  PyMuPDF + OCR  │    │  ~1000 tokens   │    │  BGE-large-v1.5 │     │
│   └─────────────────┘    └─────────────────┘    └────────┬────────┘     │
│                                                          │               │
│                                                          ▼               │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                     SQLite + sqlite-vec                          │   │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │   │
│   │  │  Documents  │  │   Chunks    │  │  Vector Embeddings      │  │   │
│   │  │   (meta)    │  │   (text)    │  │  (1024-dim, normalized) │  │   │
│   │  └─────────────┘  └─────────────┘  └─────────────────────────┘  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                    │                                     │
│                                    ▼                                     │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐     │
│   │   RAG Engine    │───▶│    Exporter     │───▶│  📝 Markdown    │     │
│   │  Hybrid Search  │    │  Jinja2 + Cites │    │   (by skill)    │     │
│   └─────────────────┘    └─────────────────┘    └─────────────────┘     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Components

Component	Technology	Purpose
PDF Extraction	PyMuPDF + pdfplumber	Fast text extraction, table handling, OCR fallback
Chunking	Rule-based	~1000 tokens, respects headers/code blocks
Embeddings	BAAI/bge-large-en-v1.5	1024-dim vectors, top MTEB retrieval model
Storage	SQLite + sqlite-vec	Single-file DB with vector similarity search
Search	Hybrid (Vector + FTS5)	Reciprocal Rank Fusion for best results
Export	Jinja2	Templated markdown with citations

📁 Project Structure

sf-knowledge-tools/
├── 📂 knowledge/              # Core Python library
│   ├── ingester/              # PDF extraction
│   ├── chunker/               # Semantic chunking
│   ├── embedder/              # Embedding generation
│   ├── storage/               # Vector store & schema
│   ├── query/                 # RAG engine
│   ├── export/                # Markdown generation
│   └── cli.py                 # Command-line interface
├── 📂 config/
│   └── knowledge.yml          # Configuration
├── 📂 pdfs/                   # Your source PDFs (gitignored)
├── 📂 data/                   # SQLite database (gitignored)
├── 📂 exports/                # Generated markdown (gitignored)
├── 📄 pyproject.toml          # Dependencies (uv)
└── 📄 LICENSE                 # MIT License

⚙️ Configuration

Edit config/knowledge.yml to customize behavior:

embeddings:
  model: BAAI/bge-large-en-v1.5    # HuggingFace model
  dimensions: 1024
  batch_size: 32

chunking:
  target_size: 1000                 # Target tokens per chunk
  max_size: 1500                    # Maximum tokens
  overlap: 100                      # Overlap between chunks

search:
  default_k: 5                      # Results to return
  hybrid_weight: 0.7                # Vector vs keyword balance

🔄 Workflow

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   1. INGEST  │────▶│   2. QUERY   │────▶│  3. EXPORT   │────▶│    4. PR     │
│              │     │              │     │              │     │              │
│ Add PDFs to  │     │ Search your  │     │ Generate     │     │ Copy to your │
│ knowledge DB │     │ knowledge    │     │ markdown     │     │ docs repo    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

_{Built with ❤️ for the Salesforce developer community}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚 sf-knowledge-tools

✨ Features

🔒 100% Offline

⚡ Fast & Efficient

🎯 Hybrid Search

📝 Export Ready

🚀 Quick Start

Prerequisites

Installation

📖 Usage

Ingest a PDF

Search the Knowledge Base

Export to Markdown

Check Status

🏗️ Architecture

Components

📁 Project Structure

⚙️ Configuration

🔄 Workflow

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
knowledge		knowledge
pdfs		pdfs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

Jaganpro/sf-knowledge-tools

Folders and files

Latest commit

History

Repository files navigation

📚 sf-knowledge-tools

✨ Features

🔒 100% Offline

⚡ Fast & Efficient

🎯 Hybrid Search

📝 Export Ready

🚀 Quick Start

Prerequisites

Installation

📖 Usage

Ingest a PDF

Search the Knowledge Base

Export to Markdown

Check Status

🏗️ Architecture

Components

📁 Project Structure

⚙️ Configuration

🔄 Workflow

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages