Skip to content

ArneJanning/rubrag

Repository files navigation

RUB RAG

Gefördert im Rahmen des DFG SFB 1567 Virtuelle Lebenswelten, Teilprojekt A03 "Virtuelle Environments" an der Ruhr-Universität Bochum.

A powerful Retrieval Augmented Generation (RAG) chatbot application for conversational Q&A over document collections. Upload files, chat with an AI that references your documents, and view citations for every answer.

Table of Contents


Features

  • Multi-format Document Support: PDF, DOCX, Excel, images, web pages, and archives
  • Multiple LLM Providers: OpenAI, Azure OpenAI, Anthropic Claude, Google Gemini, Cohere, Groq, and local models via Ollama
  • Advanced Indexing: Standard file indexing and Microsoft GraphRAG for knowledge graph-based retrieval
  • Intelligent Reasoning: Simple QA, question decomposition, ReACT agents, and ReWOO agents
  • Citation Support: Every answer includes references to source documents with page numbers
  • User Management: Multi-user support with private and shared document collections
  • Modern Web UI: Responsive Gradio-based interface with PDF viewer
  • Docker Ready: Single-command deployment with persistent storage

Quick Start

Using Docker (Fastest)

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"

# Start the application
docker-compose up

# Open http://localhost:7860 in your browser

Using Installation Scripts

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Linux
bash scripts/run_linux.sh

# macOS
bash scripts/run_macos.sh

Installation

Prerequisites

  • Python: 3.10 or higher
  • Git: For cloning the repository
  • API Key: At least one LLM provider (OpenAI recommended for getting started)

Docker (Recommended)

Docker provides the easiest and most consistent installation experience.

Option 1: Docker Compose

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Create environment file (optional)
echo "OPENAI_API_KEY=your-key-here" > .env

# Start the application
docker-compose up -d

# View logs
docker-compose logs -f

# Stop the application
docker-compose down

Option 2: Docker Build

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Build the lite version (smaller, faster)
docker build -t rubrag:lite --target lite .

# Or build the full version (includes OCR, LibreOffice, advanced document parsing)
docker build -t rubrag:full --target full .

# Run the container
docker run -d \
  -p 7860:7860 \
  -v $(pwd)/ktem_app_data:/app/ktem_app_data \
  -e OPENAI_API_KEY="your-key-here" \
  rubrag:lite

Docker Image Variants

Variant Size Features
lite ~2-3 GB Core RAG, PDF processing, all LLM providers
full ~5-7 GB Everything in lite + OCR (Tesseract), LibreOffice, FFmpeg, advanced document parsing

Linux

The installation script handles everything automatically:

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Run the installation script
bash scripts/run_linux.sh

What the script does:

  1. Installs Miniconda (if not present)
  2. Creates an isolated Python 3.10 environment
  3. Installs all dependencies
  4. Downloads PDF.js for document viewing
  5. Optionally configures local LLM support
  6. Launches the web interface

Supported architectures: x86_64 (Intel/AMD), aarch64 (ARM64)

macOS

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Run the installation script
bash scripts/run_macos.sh

Supported architectures: Intel (x86_64), Apple Silicon (ARM64)

Manual Installation

For advanced users who want full control over the installation:

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Create and activate a virtual environment
python3.10 -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install the core libraries
pip install -e libs/kotaemon
pip install -e libs/ktem

# Install PDF services (optional, for enhanced PDF processing)
pip install "pdfservices-sdk @ git+https://github.com/niallcm/pdfservices-python-sdk.git@main"

# Download PDF.js viewer
bash scripts/download_pdfjs.sh libs/ktem/ktem/assets/prebuilt

# Set environment variables
export OPENAI_API_KEY="your-api-key-here"

# Run the application
python app.py

Installing Advanced Dependencies

For full document processing capabilities:

# OCR support
pip install "kotaemon[adv]"

# Or install specific components
pip install fastembed sentence-transformers

# For Tesseract OCR (system package)
# Ubuntu/Debian: sudo apt install tesseract-ocr
# macOS: brew install tesseract

Configuration

All configuration is managed through flowsettings.py and environment variables.

LLM Providers

RUB RAG supports multiple LLM providers. Configure them via environment variables:

OpenAI (Default)

export OPENAI_API_KEY="sk-..."
export OPENAI_CHAT_MODEL="gpt-4o"  # Optional, default: gpt-3.5-turbo
export OPENAI_API_BASE="https://api.openai.com/v1"  # Optional, for custom endpoints

Azure OpenAI

export AZURE_OPENAI_API_KEY="your-azure-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_CHAT_DEPLOYMENT="gpt-4"
export AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT="text-embedding-ada-002"
export OPENAI_API_VERSION="2024-02-15-preview"

Anthropic Claude

export ANTHROPIC_API_KEY="sk-ant-..."

Google Gemini

export GOOGLE_API_KEY="your-google-key"

Groq

export GROQ_API_KEY="your-groq-key"

Cohere

export COHERE_API_KEY="your-cohere-key"

Ollama (Local)

# No API key needed - Ollama runs locally
# Ensure Ollama is running: ollama serve
export OLLAMA_BASE_URL="http://localhost:11434/v1"

Embedding Models

Configure embedding models in flowsettings.py:

KH_EMBEDDINGS = {
    "openai": {
        "spec": {
            "__type__": "kotaemon.embeddings.OpenAIEmbeddings",
            "model": "text-embedding-ada-002",
        },
        "default": True,
    },
    # Add more embedding providers as needed
}

Vector Stores

Default configuration uses ChromaDB for vectors and LanceDB for documents:

# In flowsettings.py
KH_VECTORSTORE = {
    "__type__": "kotaemon.storages.ChromaVectorStore",
    "path": str(KH_USER_DATA_DIR / "vectorstore"),
}

KH_DOCSTORE = {
    "__type__": "kotaemon.storages.LanceDBDocumentStore",
    "path": str(KH_USER_DATA_DIR / "docstore"),
}

User Management

User management is enabled by default:

# Enable/disable user management
export KH_FEATURE_USER_MANAGEMENT="true"

# Set admin credentials
export KH_FEATURE_USER_MANAGEMENT_ADMIN="admin"
export KH_FEATURE_USER_MANAGEMENT_PASSWORD="your-secure-password"

Default credentials:

  • Username: admin
  • Password: admin

Security Note: Change the default admin password before deploying to production!


Usage

Web Interface

After starting the application, open your browser to:

Uploading Documents

  1. Navigate to the Files tab

  2. Click Upload or drag and drop files

  3. Supported formats:

    • PDF documents
    • Microsoft Word (.docx)
    • Excel spreadsheets (.xlsx)
    • Plain text files
    • Images (with OCR in full version)
    • Web URLs
    • ZIP archives
  4. Wait for indexing to complete (status shown in UI)

Chat Features

  1. Ask Questions: Type your question in the chat box
  2. View Citations: Click on references to see source documents
  3. PDF Viewer: View highlighted passages in the built-in PDF viewer
  4. Conversation History: Access previous conversations from the sidebar
  5. Follow-up Questions: Continue the conversation with context preserved

Reasoning Pipelines

Select different reasoning strategies based on your needs:

Pipeline Best For Description
Simple Quick answers Direct retrieval and response
Full QA Detailed answers Enhanced context processing
Decompose Complex questions Breaks question into sub-questions
ReACT Multi-step tasks Reasoning + Acting agent
ReWOO Planning tasks Plan then execute approach

Change the reasoning pipeline in the Settings tab.


Advanced Configuration

GraphRAG Integration

GraphRAG provides knowledge graph-based retrieval for better understanding of entity relationships.

  1. Enable GraphRAG in flowsettings.py:
KH_INDICES = [
    {
        "name": "File Index",
        "config": {...},
        "default": True,
    },
    {
        "name": "GraphRAG Index",
        "config": {
            "__type__": "ktem.index.file.graph.GraphRAGIndex",
            "doc_store": KH_DOCSTORE,
        },
    },
]
  1. Configure GraphRAG settings in settings.yaml:
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: azure_openai_chat
  model: gpt-4

embeddings:
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding
    model: text-embedding-ada-002
  1. Set environment variable:
export GRAPHRAG_API_KEY="your-api-key"
export USE_CUSTOMIZED_GRAPHRAG_SETTING="true"

Local LLM Support

Run RUB RAG completely offline with local models:

Using Ollama

  1. Install Ollama: https://ollama.ai

  2. Pull a model:

ollama pull llama3.2
ollama pull nomic-embed-text
  1. Configure in flowsettings.py (already included):
KH_LLMS = {
    "ollama": {
        "spec": {
            "__type__": "kotaemon.llms.ChatOpenAI",
            "base_url": "http://localhost:11434/v1",
            "model": "llama3.2",
            "api_key": "ollama",
        },
    },
}

Using llama.cpp

For GGUF model files:

# Set the model path
export LOCAL_MODEL="/path/to/model.gguf"

# Start the server
bash scripts/server_llamacpp_linux.sh  # or server_llamacpp_macos.sh

Custom Pipelines

Create custom reasoning pipelines by extending BaseComponent:

# In your custom module
from kotaemon.base import BaseComponent

class MyCustomPipeline(BaseComponent):
    """Custom reasoning pipeline"""

    def run(self, question: str, history: list, **kwargs):
        # Your custom logic here
        return {"answer": "...", "citations": [...]}

Register in flowsettings.py:

KH_REASONINGS = [
    "ktem.reasoning.simple.FullQAPipeline",
    "your_module.MyCustomPipeline",  # Add your pipeline
]

See docs/pages/app/customize-flows.md for detailed documentation.


Architecture

rubrag/
├── app.py                 # Application entry point
├── flowsettings.py        # Central configuration
├── settings.yaml          # GraphRAG configuration
├── docker-compose.yml     # Docker orchestration
├── Dockerfile             # Multi-stage Docker build
│
├── libs/
│   ├── kotaemon/          # Core RAG library
│   │   ├── agents/        # ReACT, ReWOO agents
│   │   ├── embeddings/    # Embedding model interfaces
│   │   ├── llms/          # LLM provider integrations
│   │   ├── loaders/       # Document loaders
│   │   ├── indices/       # Indexing and retrieval
│   │   └── storages/      # Vector and document stores
│   │
│   └── ktem/              # Application framework
│       ├── pages/         # UI pages (chat, settings, login)
│       ├── reasoning/     # Reasoning pipelines
│       ├── index/         # Indexing system
│       ├── db/            # Database models
│       └── assets/        # CSS, JS, images
│
├── scripts/               # Installation and utility scripts
├── templates/             # Customization templates
└── docs/                  # Documentation

Key Components

Component Purpose
kotaemon Core RAG framework with LLM, embedding, and storage abstractions
ktem Application layer with Gradio UI, user management, and pipelines
flowsettings.py Configuration hub for all components

Data Storage

Data Type Location Purpose
SQLite Database ktem_app_data/user_data/sql.db Users, conversations, file metadata
Vector Store ktem_app_data/user_data/vectorstore/ Document embeddings (ChromaDB)
Document Store ktem_app_data/user_data/docstore/ Document content (LanceDB)

Troubleshooting

Common Issues

"No module named 'kotaemon'"

# Ensure you're in the virtual environment
source venv/bin/activate

# Reinstall the libraries
pip install -e libs/kotaemon
pip install -e libs/ktem

"OPENAI_API_KEY not set"

# Set the environment variable
export OPENAI_API_KEY="sk-..."

# Or create a .env file
echo "OPENAI_API_KEY=sk-..." > .env

Docker container won't start

# Check logs
docker-compose logs

# Ensure port 7860 is available
lsof -i :7860

# Restart with fresh state
docker-compose down -v
docker-compose up --build

PDF viewer not working

# Download PDF.js manually
bash scripts/download_pdfjs.sh libs/ktem/ktem/assets/prebuilt

Out of memory errors

  • Reduce chunk_size in settings
  • Use a smaller embedding model
  • Enable document pagination
  • Increase Docker memory limit

Getting Help


Development

Setting Up Development Environment

# Clone and enter the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e libs/kotaemon
pip install -e libs/ktem

# Install development dependencies
pip install black isort flake8 mypy pytest

# Set up pre-commit hooks
pip install pre-commit
pre-commit install

Code Quality

The project uses several tools for code quality:

# Format code
black .
isort .

# Lint
flake8 .

# Type checking
mypy libs/kotaemon libs/ktem

# Run tests
pytest libs/ktem/ktem_tests
pytest libs/kotaemon/tests

Commit Standards

Commits follow the Conventional Commits specification:

feat: add new reasoning pipeline
fix: resolve PDF rendering issue
docs: update installation guide
refactor: simplify embedding interface

Valid types: build, chore, ci, docs, feat, fix, perf, refactor, revert, style, test


License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

  • @trducng
  • @lone17
  • @taprosoft
  • @cin-albert

Acknowledgments

Gefördert im Rahmen des DFG SFB 1567 Virtuelle Lebenswelten, Teilprojekt A03 "Virtuelle Environments" an der Ruhr-Universität Bochum.

Built with:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •