Skip to content

Jaganpro/sf-knowledge-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Salesforce Python SQLite MIT License

๐Ÿ“š sf-knowledge-tools

Local PDF Knowledge Extraction & RAG Pipeline
Transform Salesforce documentation into searchable, AI-ready knowledge

Features โ€ข Quick Start โ€ข Usage โ€ข Architecture โ€ข Configuration


โœจ Features

๐Ÿ”’ 100% Offline

No API calls, no cloud dependencies. Your documents stay on your machine. Process sensitive internal documentation with confidence.

โšก Fast & Efficient

PyMuPDF extracts 90% of pages in milliseconds. sqlite-vec provides sub-50ms vector search at scale.

๐ŸŽฏ Hybrid Search

Combines semantic understanding (vector similarity) with keyword matching (FTS5) using Reciprocal Rank Fusion.

๐Ÿ“ Export Ready

Generate clean markdown with citations, organized by skill. Perfect for PRs to your documentation repos.


๐Ÿš€ Quick Start

Prerequisites

# macOS (Homebrew)
brew install tesseract uv

# Or install uv via pip
pip install uv

Installation

# Clone the repository
git clone https://github.com/Jaganpro/sf-knowledge-tools.git
cd sf-knowledge-tools

# Install dependencies (creates .venv automatically)
uv sync

๐Ÿ’ก First Run: The embedding model (~1.3GB) downloads automatically on first use.


๐Ÿ“– Usage

Ingest a PDF

sf-knowledge ingest ~/Documents/salesforce-apex-guide.pdf --category apex
๐Ÿ“‹ Example Output
๐Ÿ“„ Ingesting: salesforce-apex-guide.pdf
  Extracting PDF...                      โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100%
  Chunking content...                    โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100%
  Generating embeddings (450 chunks)...  โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100%
  Storing chunks...                      โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100%

โœ… Successfully ingested: Salesforce Apex Developer Guide
   Document ID  a1b2c3d4
   Pages        234
   Chunks       450
   Category     apex

Search the Knowledge Base

sf-knowledge query "How do I handle governor limits in batch Apex?"
๐Ÿ“‹ Example Output
๐Ÿ” Searching: How do I handle governor limits in batch Apex?

Found 5 results in 45.2ms

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ“Œ Result 1 (score: 0.892) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Governor limits are enforced at runtime. In batch Apex, each execute    โ”‚
โ”‚ method invocation gets a fresh set of limits. To avoid hitting limits:  โ”‚
โ”‚                                                                          โ”‚
โ”‚ 1. Use Database.Stateful to maintain state across batches               โ”‚
โ”‚ 2. Keep batch size manageable (default 200, reduce if needed)           โ”‚
โ”‚ 3. Use Database.executeBatch() with scope parameter                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Chapter: Batch Apex | p. 145 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Export to Markdown

sf-knowledge export "Apex Governor Limits" --skill sf-apex
โœ… Exported to: exports/sf-apex/apex-governor-limits.md

Check Status

sf-knowledge status
๐Ÿ“Š Knowledge Base Status

     Database
 Location  data/knowledge.db
 Size      24.5 MB

     Content
 Documents   3
 Chunks      2,450
 Embeddings  2,450

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         sf-knowledge-tools                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                          โ”‚
โ”‚   ๐Ÿ“„ PDF Input                                                           โ”‚
โ”‚       โ”‚                                                                  โ”‚
โ”‚       โ–ผ                                                                  โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚   โ”‚   PDF Extractor โ”‚โ”€โ”€โ”€โ–ถโ”‚ Semantic Chunkerโ”‚โ”€โ”€โ”€โ–ถโ”‚ Embedding Clientโ”‚     โ”‚
โ”‚   โ”‚  PyMuPDF + OCR  โ”‚    โ”‚  ~1000 tokens   โ”‚    โ”‚  BGE-large-v1.5 โ”‚     โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                                                          โ”‚               โ”‚
โ”‚                                                          โ–ผ               โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚                     SQLite + sqlite-vec                          โ”‚   โ”‚
โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚   โ”‚
โ”‚   โ”‚  โ”‚  Documents  โ”‚  โ”‚   Chunks    โ”‚  โ”‚  Vector Embeddings      โ”‚  โ”‚   โ”‚
โ”‚   โ”‚  โ”‚   (meta)    โ”‚  โ”‚   (text)    โ”‚  โ”‚  (1024-dim, normalized) โ”‚  โ”‚   โ”‚
โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                    โ”‚                                     โ”‚
โ”‚                                    โ–ผ                                     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚   โ”‚   RAG Engine    โ”‚โ”€โ”€โ”€โ–ถโ”‚    Exporter     โ”‚โ”€โ”€โ”€โ–ถโ”‚  ๐Ÿ“ Markdown    โ”‚     โ”‚
โ”‚   โ”‚  Hybrid Search  โ”‚    โ”‚  Jinja2 + Cites โ”‚    โ”‚   (by skill)    โ”‚     โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                                                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Components

Component Technology Purpose
PDF Extraction PyMuPDF + pdfplumber Fast text extraction, table handling, OCR fallback
Chunking Rule-based ~1000 tokens, respects headers/code blocks
Embeddings BAAI/bge-large-en-v1.5 1024-dim vectors, top MTEB retrieval model
Storage SQLite + sqlite-vec Single-file DB with vector similarity search
Search Hybrid (Vector + FTS5) Reciprocal Rank Fusion for best results
Export Jinja2 Templated markdown with citations

๐Ÿ“ Project Structure

sf-knowledge-tools/
โ”œโ”€โ”€ ๐Ÿ“‚ knowledge/              # Core Python library
โ”‚   โ”œโ”€โ”€ ingester/              # PDF extraction
โ”‚   โ”œโ”€โ”€ chunker/               # Semantic chunking
โ”‚   โ”œโ”€โ”€ embedder/              # Embedding generation
โ”‚   โ”œโ”€โ”€ storage/               # Vector store & schema
โ”‚   โ”œโ”€โ”€ query/                 # RAG engine
โ”‚   โ”œโ”€โ”€ export/                # Markdown generation
โ”‚   โ””โ”€โ”€ cli.py                 # Command-line interface
โ”œโ”€โ”€ ๐Ÿ“‚ config/
โ”‚   โ””โ”€โ”€ knowledge.yml          # Configuration
โ”œโ”€โ”€ ๐Ÿ“‚ pdfs/                   # Your source PDFs (gitignored)
โ”œโ”€โ”€ ๐Ÿ“‚ data/                   # SQLite database (gitignored)
โ”œโ”€โ”€ ๐Ÿ“‚ exports/                # Generated markdown (gitignored)
โ”œโ”€โ”€ ๐Ÿ“„ pyproject.toml          # Dependencies (uv)
โ””โ”€โ”€ ๐Ÿ“„ LICENSE                 # MIT License

โš™๏ธ Configuration

Edit config/knowledge.yml to customize behavior:

embeddings:
  model: BAAI/bge-large-en-v1.5    # HuggingFace model
  dimensions: 1024
  batch_size: 32

chunking:
  target_size: 1000                 # Target tokens per chunk
  max_size: 1500                    # Maximum tokens
  overlap: 100                      # Overlap between chunks

search:
  default_k: 5                      # Results to return
  hybrid_weight: 0.7                # Vector vs keyword balance

๐Ÿ”„ Workflow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   1. INGEST  โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   2. QUERY   โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  3. EXPORT   โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚    4. PR     โ”‚
โ”‚              โ”‚     โ”‚              โ”‚     โ”‚              โ”‚     โ”‚              โ”‚
โ”‚ Add PDFs to  โ”‚     โ”‚ Search your  โ”‚     โ”‚ Generate     โ”‚     โ”‚ Copy to your โ”‚
โ”‚ knowledge DB โ”‚     โ”‚ knowledge    โ”‚     โ”‚ markdown     โ”‚     โ”‚ docs repo    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with โค๏ธ for the Salesforce developer community

About

Local PDF knowledge extraction and RAG pipeline for Salesforce documentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages