RAG System with JSON Aggregation

Overview

This system integrates a Retrieval-Augmented Generation (RAG) architecture with JSON aggregation, enabling efficient document retrieval and structured data processing.

Architecture

Core Components

Document Ingestion (DocumentIngestor)
- Supports multiple file formats (PDF, DOCX, JSON, TXT)
- Extracts content and metadata
- Chunks content for non-JSON files
- Generates embeddings using SentenceTransformer
RAG System (RAGSystem)
- Implements semantic search functionality
- Uses vector similarity for document retrieval
- Returns ranked results with relevance scores
JSON Aggregator (JSONAggregator)
- Performs aggregation operations on JSON fields
- Supports nested JSON traversal
- Provides multiple aggregation types (COUNT, SUM, MEAN, etc.)
Embedding Generator (EmbeddingGenerator)
- Uses sentence-transformers/all-MiniLM-L6-v2
- Generates vector embeddings for text

Data Flow

Documents are uploaded through the API.
Content is processed and stored in Weaviate.
Queries can be performed using semantic search.
JSON aggregations can be performed on stored documents.

Document Processing Details

PDF Processing Capabilities

The system handles both text-based and image-based PDFs through a sophisticated dual-processing approach:

Text Extraction

Primary Method: Uses pdfplumber for text-based PDFs
Fallback Method: Automatically switches to OCR if:
- Text extraction fails
- Extracted text is too short (<50 characters)
- PDF contains primarily images

OCR Processing

Uses pytesseract and pdf2image for image-based PDFs
Converts PDF pages to high-resolution images
Performs OCR on each page individually
Combines results into a single searchable text

Prerequisites for OCR

System Dependencies

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
    tesseract-ocr \
    poppler-utils

# MacOS
brew install tesseract
brew install poppler

Python Dependencies

pip install pytesseract pdf2image Pillow

Installation

Prerequisites

Python 3.8+
pip package manager
Docker (for Weaviate)

Setup

Clone the repository

git clone git@github.com:Bl4ck-h00d/RAG.git
cd RAG

Create virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies

pip install fastapi uvicorn weaviate-client openai pdfplumber python-docx pydantic python-multipart torch sentence_transformers pdf2image Pillow pytesseract

Start Weaviate

docker-compose up -d

Run the ingestion and search services

python3 -m app.main

The application will be available at http://localhost:8000

API Documentation

The API provides the following endpoints:

Upload Documents

URL: POST /upload

curl -X POST -F "file=@/path/to/your/document.pdf" http://51.20.182.187:8000/upload

Query Documents

URL: POST /query

curl -X POST -H "Content-Type: application/json" \
  -d '{"query": "your search query", "limit": 5}' \
  http://51.20.182.187:8000/query

Aggregate JSON Field

Note

doc_id is required for aggregation operations to specify a particular document
URL: GET /aggregate/{field_path}

# Count aggregation
curl "http://51.20.182.187:8000/aggregate/json.customer_id?doc_id=<doc_id>&operation=count"

# Text occurrences
curl "http://51.20.182.187:8000/aggregate/json.membership?doc_id=<doc_id>&operation=text_occurrences"

# Numeric operations
curl "http://51.20.182.187:8000/aggregate/json.total_spent?doc_id=<doc_id>&operation=sum"
curl "http://51.20.182.187:8000/aggregate/json.total_spent?doc_id=<doc_id>&operation=mean"
curl "http://51.20.182.187:8000/aggregate/json.total_spent?doc_id=<doc_id>&operation=median"
curl "http://51.20.182.187:8000/aggregate/json.total_spent?doc_id=<doc_id>&operation=min"
curl "http://51.20.182.187:8000/aggregate/json.total_spent?doc_id=<doc_id>&operation=max"

WIP: Aggregation with Query Text

curl "http://localhost:8000/aggregate/json.membership?doc_id=<doc_id>&operation=count&query_text=\"Gold\""

This should match the query_text with the values in the json field and return the count of the matching values.

Supported Operations for JSON Aggregation

COUNT
SUM
MEAN
MODE
MEDIAN
MIN
MAX
TEXT_OCCURRENCES

JSON Field Path Notation

Simple paths: json.field1
Array access: json.field1[].field2
Nested arrays: json.field1[].field2[].field3
Nested objects: json.field1.field2.field3

The architecture is designed to be modular and extensible, with clear separation of concerns between document processing, vector search, and aggregation functionalities.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
docker-compose.yml		docker-compose.yml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG System with JSON Aggregation

Overview

Architecture

Core Components

Data Flow

Document Processing Details

PDF Processing Capabilities

Text Extraction

OCR Processing

Prerequisites for OCR

System Dependencies

Python Dependencies

Installation

Prerequisites

Setup

Clone the repository

Create virtual environment:

Install dependencies

Start Weaviate

Run the ingestion and search services

API Documentation

Upload Documents

Query Documents

Aggregate JSON Field

Note

WIP: Aggregation with Query Text

Supported Operations for JSON Aggregation

JSON Field Path Notation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Bl4ck-h00d/RAG

Folders and files

Latest commit

History

Repository files navigation

RAG System with JSON Aggregation

Overview

Architecture

Core Components

Data Flow

Document Processing Details

PDF Processing Capabilities

Text Extraction

OCR Processing

Prerequisites for OCR

System Dependencies

Python Dependencies

Installation

Prerequisites

Setup

Clone the repository

Create virtual environment:

Install dependencies

Start Weaviate

Run the ingestion and search services

API Documentation

Upload Documents

Query Documents

Aggregate JSON Field

Note

WIP: Aggregation with Query Text

Supported Operations for JSON Aggregation

JSON Field Path Notation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages