A comprehensive Streamlit application for processing PDFs from Google Drive using LlamaParse, creating embeddings with LlamaIndex, and storing vectors in Pinecone.
- Password Protection: Simple password-based access control
- LlamaParse Integration: High-quality PDF parsing with configurable parameters
- Flexible Chunking: Multiple strategies (Token-based, Sentence-based, Semantic)
- Multiple Embedding Models: Support for OpenAI and HuggingFace models
- Metadata Enrichment: Automatic metadata mapping from Google Sheets
- Pinecone Storage: Organized storage with projects, indices, and namespaces
- Batch Processing: Process multiple PDFs from a single Google Sheets configuration
You'll need API keys for the following services:
- LlamaParse: Get from cloud.llamaindex.ai
- Pinecone: Get from app.pinecone.io
- OpenAI (optional): For OpenAI embeddings from platform.openai.com
- Google Cloud: Service account JSON for Drive & Sheets access (required)
- Create a project in Google Cloud Console
- Enable Google Drive API and Google Sheets API
- Create a Service Account with the following roles:
- Google Drive: Viewer or Editor
- Google Sheets: Viewer or Editor
- Download the JSON credentials file (keep this secure!)
- Share your Google Drive folders and Sheets with the service account email address
- The service account email looks like:
your-account@your-project.iam.gserviceaccount.com - Give it at least "Viewer" permission on files you want to process
- The service account email looks like:
Configure the following in Replit Secrets or create a .env file:
APP_PASSWORD=your_password
LLAMA_CLOUD_API_KEY=your_key
PINECONE_API_KEY=your_key
OPENAI_API_KEY=your_key
Your Google Sheet should have:
- A column with Google Drive PDF links
- Columns for metadata/tags you want to attach to chunks
- Optional: A column to specify Pinecone namespaces
- Configure API Keys: Enter your API keys in the Configuration tab
- Upload Google Credentials: Upload your service account JSON file
- Load Sheet Data: Enter your Google Sheets URL and load the data
- Select PDFs: Choose which PDFs to process
- Configure Processing: Set LlamaParse, chunking, and embedding options
- Process: Start the pipeline and monitor progress
- Review Status: Check processing results and statistics
- Token-based: Fixed token size with overlap
- Sentence-based: Splits on sentence boundaries
- Semantic: AI-powered semantic boundary detection
- OpenAI: text-embedding-3-small, text-embedding-3-large, ada-002
- HuggingFace: BAAI/bge models, sentence-transformers
- Parsing mode (auto, fast, premium)
- Result type (markdown, text)
- Language settings
- Multimodal model support
- Custom page separators
PDF (Google Drive)
↓
LlamaParse (parsing)
↓
LlamaIndex (chunking)
↓
Embedding Models (vectorization)
↓
Metadata Enrichment (from Google Sheets)
↓
Pinecone (storage)
- Password-protected access
- API keys stored securely in environment variables
- Service account credentials handled securely
- No credentials stored in code
For issues or questions, check the documentation for: