A minimal Retrieval-Augmented Generation (RAG) pipeline using OpenAI and ChromaDB to answer questions from your own PDF documents.
- Initialize OpenAI embeddings and chat models
- Ingest PDF documents and extract text
- Split text into overlapping chunks
- Store embeddings in a persistent ChromaDB collection
- Load ChromaDB and retrieve relevant chunks
- Generate responses to user questions using OpenAI Chat API
- .env support with .gitignore to keep secrets out of version control
- Create and activate a virtual environment (optional but recommended).
- Install dependencies: pip install -r requirements.txt
- Create a .env file in the project root with your OpenAI API key: OPENAI_API_KEY=sk-...
Optional environment variables:
- OPENAI_EMBEDDING_MODEL (default: text-embedding-3-small)
- OPENAI_CHAT_MODEL (default: gpt-4o-mini)
- CHROMA_DIR (default: ./chroma)
- CHROMA_COLLECTION (default: rag_docs)
- CHUNK_SIZE (default: 1000)
- CHUNK_OVERLAP (default: 200)
- TOP_K (default: 3)
Ingest PDFs from a folder: python -m src.main ingest path\to\pdf_folder
Ask a question: python -m src.main ask "What are the key points?"
- The Chroma vector store is persisted under CHROMA_DIR, which you may want to add to .gitignore if committing publicly.
- Ensure your .env is listed in .gitignore (it is included by default in this repo).