PDF Chunking & Embedding Pipeline

A comprehensive Streamlit application for processing PDFs from Google Drive using LlamaParse, creating embeddings with LlamaIndex, and storing vectors in Pinecone.

Features

Password Protection: Simple password-based access control
LlamaParse Integration: High-quality PDF parsing with configurable parameters
Flexible Chunking: Multiple strategies (Token-based, Sentence-based, Semantic)
Multiple Embedding Models: Support for OpenAI and HuggingFace models
Metadata Enrichment: Automatic metadata mapping from Google Sheets
Pinecone Storage: Organized storage with projects, indices, and namespaces
Batch Processing: Process multiple PDFs from a single Google Sheets configuration

Setup

1. API Keys Required

You'll need API keys for the following services:

LlamaParse: Get from cloud.llamaindex.ai
Pinecone: Get from app.pinecone.io
OpenAI (optional): For OpenAI embeddings from platform.openai.com
Google Cloud: Service account JSON for Drive & Sheets access (required)

2. Google Cloud Setup (Required)

Create a project in Google Cloud Console
Enable Google Drive API and Google Sheets API
Create a Service Account with the following roles:
- Google Drive: Viewer or Editor
- Google Sheets: Viewer or Editor
Download the JSON credentials file (keep this secure!)
Share your Google Drive folders and Sheets with the service account email address
- The service account email looks like: your-account@your-project.iam.gserviceaccount.com
- Give it at least "Viewer" permission on files you want to process

3. Environment Variables

Configure the following in Replit Secrets or create a .env file:

APP_PASSWORD=your_password
LLAMA_CLOUD_API_KEY=your_key
PINECONE_API_KEY=your_key
OPENAI_API_KEY=your_key

4. Google Sheets Format

Your Google Sheet should have:

A column with Google Drive PDF links
Columns for metadata/tags you want to attach to chunks
Optional: A column to specify Pinecone namespaces

Usage

Configure API Keys: Enter your API keys in the Configuration tab
Upload Google Credentials: Upload your service account JSON file
Load Sheet Data: Enter your Google Sheets URL and load the data
Select PDFs: Choose which PDFs to process
Configure Processing: Set LlamaParse, chunking, and embedding options
Process: Start the pipeline and monitor progress
Review Status: Check processing results and statistics

Chunking Strategies

Token-based: Fixed token size with overlap
Sentence-based: Splits on sentence boundaries
Semantic: AI-powered semantic boundary detection

Embedding Models

OpenAI: text-embedding-3-small, text-embedding-3-large, ada-002
HuggingFace: BAAI/bge models, sentence-transformers

LlamaParse Parameters

Parsing mode (auto, fast, premium)
Result type (markdown, text)
Language settings
Multimodal model support
Custom page separators

Architecture

PDF (Google Drive)
    ↓
LlamaParse (parsing)
    ↓
LlamaIndex (chunking)
    ↓
Embedding Models (vectorization)
    ↓
Metadata Enrichment (from Google Sheets)
    ↓
Pinecone (storage)

Security

Password-protected access
API keys stored securely in environment variables
Service account credentials handled securely
No credentials stored in code

Support

For issues or questions, check the documentation for:

Name		Name	Last commit message	Last commit date
Latest commit History 371 Commits
.streamlit		.streamlit
attached_assets		attached_assets
components		components
config		config
page_modules		page_modules
tests		tests
utils		utils
.coverage		.coverage
.deployignore		.deployignore
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.replit		.replit
API_ALIGNMENT_UPDATES.md		API_ALIGNMENT_UPDATES.md
BACKFILL_GUIDE.md		BACKFILL_GUIDE.md
CELERY_SETUP.md		CELERY_SETUP.md
DATABASE_SPLIT_PLAN.md		DATABASE_SPLIT_PLAN.md
DEDUPLICATION_TIMEOUT_ISSUE.md		DEDUPLICATION_TIMEOUT_ISSUE.md
DEPLOYMENT_OPTIMIZATIONS.md		DEPLOYMENT_OPTIMIZATIONS.md
DIAGNOSTIC_INSTRUCTIONS.md		DIAGNOSTIC_INSTRUCTIONS.md
ENHANCEMENT_PLAN.md		ENHANCEMENT_PLAN.md
METADATA_ENRICHMENT_SUMMARY.md		METADATA_ENRICHMENT_SUMMARY.md
PHASE_1_2_PROGRESS.md		PHASE_1_2_PROGRESS.md
PRODUCTION_BACKFILL_GUIDE.md		PRODUCTION_BACKFILL_GUIDE.md
PRODUCTION_SETUP.md		PRODUCTION_SETUP.md
README.md		README.md
REFACTORING_COMPLETE_SUMMARY.md		REFACTORING_COMPLETE_SUMMARY.md
REFACTORING_PATTERN.md		REFACTORING_PATTERN.md
REFACTORING_PLAN.md		REFACTORING_PLAN.md
REFACTORING_PROGRESS_SUMMARY.md		REFACTORING_PROGRESS_SUMMARY.md
SCHEDULER_DEPLOYMENT.md		SCHEDULER_DEPLOYMENT.md
USAGE_GUIDE.md		USAGE_GUIDE.md
app.py		app.py
app.py.backup		app.py.backup
app_transformations.py		app_transformations.py
backfill_processed_papers.py		backfill_processed_papers.py
celeryconfig.py		celeryconfig.py
celeryconfig_before_after.md		celeryconfig_before_after.md
celeryconfig_improvements.md		celeryconfig_improvements.md
cleanup_orphaned_dedup_records.py		cleanup_orphaned_dedup_records.py
delete_dedup_by_daterange.py		delete_dedup_by_daterange.py
find_stuck_papers.py		find_stuck_papers.py
main.py		main.py
migrate_enrich_parsed_documents.py		migrate_enrich_parsed_documents.py
migrate_metadata_fingerprints.py		migrate_metadata_fingerprints.py
migrate_row_numbers.py		migrate_row_numbers.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
replit.md		replit.md
reprocess_from_parsed.py		reprocess_from_parsed.py
scheduler.py		scheduler.py
scheduler_before_after_comparison.md		scheduler_before_after_comparison.md
scheduler_final_polish.md		scheduler_final_polish.md
scheduler_improvements.md		scheduler_improvements.md
scheduler_old_backup.py		scheduler_old_backup.py
show_product_chunks.py		show_product_chunks.py
start_production.sh		start_production.sh
tasks.py		tasks.py
tasks_before_after.md		tasks_before_after.md
tasks_improvements.md		tasks_improvements.md
tasks_metadata.py		tasks_metadata.py
test_config_builder.py		test_config_builder.py
test_error_handling.py		test_error_handling.py
test_hash_detection.py		test_hash_detection.py
transient_error_integration.md		transient_error_integration.md
uv.lock		uv.lock
validate_config_models.py		validate_config_models.py
verify_pinecone_vectors.py		verify_pinecone_vectors.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Chunking & Embedding Pipeline

Features

Setup

1. API Keys Required

2. Google Cloud Setup (Required)

3. Environment Variables

4. Google Sheets Format

Usage

Chunking Strategies

Embedding Models

LlamaParse Parameters

Architecture

Security

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Chunking & Embedding Pipeline

Features

Setup

1. API Keys Required

2. Google Cloud Setup (Required)

3. Environment Variables

4. Google Sheets Format

Usage

Chunking Strategies

Embedding Models

LlamaParse Parameters

Architecture

Security

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages