RUB RAG

Gefördert im Rahmen des DFG SFB 1567 Virtuelle Lebenswelten, Teilprojekt A03 "Virtuelle Environments" an der Ruhr-Universität Bochum.

A powerful Retrieval Augmented Generation (RAG) chatbot application for conversational Q&A over document collections. Upload files, chat with an AI that references your documents, and view citations for every answer.

Features

Multi-format Document Support: PDF, DOCX, Excel, images, web pages, and archives
Multiple LLM Providers: OpenAI, Azure OpenAI, Anthropic Claude, Google Gemini, Cohere, Groq, and local models via Ollama
Advanced Indexing: Standard file indexing and Microsoft GraphRAG for knowledge graph-based retrieval
Intelligent Reasoning: Simple QA, question decomposition, ReACT agents, and ReWOO agents
Citation Support: Every answer includes references to source documents with page numbers
User Management: Multi-user support with private and shared document collections
Modern Web UI: Responsive Gradio-based interface with PDF viewer
Docker Ready: Single-command deployment with persistent storage

Quick Start

Using Docker (Fastest)

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"

# Start the application
docker-compose up

# Open http://localhost:7860 in your browser

Using Installation Scripts

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Linux
bash scripts/run_linux.sh

# macOS
bash scripts/run_macos.sh

Installation

Prerequisites

Python: 3.10 or higher
Git: For cloning the repository
API Key: At least one LLM provider (OpenAI recommended for getting started)

Docker (Recommended)

Docker provides the easiest and most consistent installation experience.

Option 1: Docker Compose

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Create environment file (optional)
echo "OPENAI_API_KEY=your-key-here" > .env

# Start the application
docker-compose up -d

# View logs
docker-compose logs -f

# Stop the application
docker-compose down

Option 2: Docker Build

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Build the lite version (smaller, faster)
docker build -t rubrag:lite --target lite .

# Or build the full version (includes OCR, LibreOffice, advanced document parsing)
docker build -t rubrag:full --target full .

# Run the container
docker run -d \
  -p 7860:7860 \
  -v $(pwd)/ktem_app_data:/app/ktem_app_data \
  -e OPENAI_API_KEY="your-key-here" \
  rubrag:lite

Docker Image Variants

Variant	Size	Features
lite	~2-3 GB	Core RAG, PDF processing, all LLM providers
full	~5-7 GB	Everything in lite + OCR (Tesseract), LibreOffice, FFmpeg, advanced document parsing

Linux

The installation script handles everything automatically:

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Run the installation script
bash scripts/run_linux.sh

What the script does:

Installs Miniconda (if not present)
Creates an isolated Python 3.10 environment
Installs all dependencies
Downloads PDF.js for document viewing
Optionally configures local LLM support
Launches the web interface

Supported architectures: x86_64 (Intel/AMD), aarch64 (ARM64)

macOS

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Run the installation script
bash scripts/run_macos.sh

Supported architectures: Intel (x86_64), Apple Silicon (ARM64)

Manual Installation

For advanced users who want full control over the installation:

# Clone the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Create and activate a virtual environment
python3.10 -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install the core libraries
pip install -e libs/kotaemon
pip install -e libs/ktem

# Install PDF services (optional, for enhanced PDF processing)
pip install "pdfservices-sdk @ git+https://github.com/niallcm/pdfservices-python-sdk.git@main"

# Download PDF.js viewer
bash scripts/download_pdfjs.sh libs/ktem/ktem/assets/prebuilt

# Set environment variables
export OPENAI_API_KEY="your-api-key-here"

# Run the application
python app.py

Installing Advanced Dependencies

For full document processing capabilities:

# OCR support
pip install "kotaemon[adv]"

# Or install specific components
pip install fastembed sentence-transformers

# For Tesseract OCR (system package)
# Ubuntu/Debian: sudo apt install tesseract-ocr
# macOS: brew install tesseract

Configuration

All configuration is managed through flowsettings.py and environment variables.

LLM Providers

RUB RAG supports multiple LLM providers. Configure them via environment variables:

OpenAI (Default)

export OPENAI_API_KEY="sk-..."
export OPENAI_CHAT_MODEL="gpt-4o"  # Optional, default: gpt-3.5-turbo
export OPENAI_API_BASE="https://api.openai.com/v1"  # Optional, for custom endpoints

Azure OpenAI

export AZURE_OPENAI_API_KEY="your-azure-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_CHAT_DEPLOYMENT="gpt-4"
export AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT="text-embedding-ada-002"
export OPENAI_API_VERSION="2024-02-15-preview"

Anthropic Claude

export ANTHROPIC_API_KEY="sk-ant-..."

Google Gemini

export GOOGLE_API_KEY="your-google-key"

Groq

export GROQ_API_KEY="your-groq-key"

Cohere

export COHERE_API_KEY="your-cohere-key"

Ollama (Local)

# No API key needed - Ollama runs locally
# Ensure Ollama is running: ollama serve
export OLLAMA_BASE_URL="http://localhost:11434/v1"

Embedding Models

Configure embedding models in flowsettings.py:

KH_EMBEDDINGS = {
    "openai": {
        "spec": {
            "__type__": "kotaemon.embeddings.OpenAIEmbeddings",
            "model": "text-embedding-ada-002",
        },
        "default": True,
    },
    # Add more embedding providers as needed
}

Vector Stores

Default configuration uses ChromaDB for vectors and LanceDB for documents:

# In flowsettings.py
KH_VECTORSTORE = {
    "__type__": "kotaemon.storages.ChromaVectorStore",
    "path": str(KH_USER_DATA_DIR / "vectorstore"),
}

KH_DOCSTORE = {
    "__type__": "kotaemon.storages.LanceDBDocumentStore",
    "path": str(KH_USER_DATA_DIR / "docstore"),
}

User Management

User management is enabled by default:

# Enable/disable user management
export KH_FEATURE_USER_MANAGEMENT="true"

# Set admin credentials
export KH_FEATURE_USER_MANAGEMENT_ADMIN="admin"
export KH_FEATURE_USER_MANAGEMENT_PASSWORD="your-secure-password"

Default credentials:

Username: admin
Password: admin

Security Note: Change the default admin password before deploying to production!

Usage

Web Interface

After starting the application, open your browser to:

Local: http://localhost:7860
Docker: http://localhost:7860

Uploading Documents

Navigate to the Files tab
Click Upload or drag and drop files
Supported formats:
- PDF documents
- Microsoft Word (.docx)
- Excel spreadsheets (.xlsx)
- Plain text files
- Images (with OCR in full version)
- Web URLs
- ZIP archives
Wait for indexing to complete (status shown in UI)

Chat Features

Ask Questions: Type your question in the chat box
View Citations: Click on references to see source documents
PDF Viewer: View highlighted passages in the built-in PDF viewer
Conversation History: Access previous conversations from the sidebar
Follow-up Questions: Continue the conversation with context preserved

Reasoning Pipelines

Select different reasoning strategies based on your needs:

Pipeline	Best For	Description
Simple	Quick answers	Direct retrieval and response
Full QA	Detailed answers	Enhanced context processing
Decompose	Complex questions	Breaks question into sub-questions
ReACT	Multi-step tasks	Reasoning + Acting agent
ReWOO	Planning tasks	Plan then execute approach

Change the reasoning pipeline in the Settings tab.

Advanced Configuration

GraphRAG Integration

GraphRAG provides knowledge graph-based retrieval for better understanding of entity relationships.

Enable GraphRAG in flowsettings.py:

KH_INDICES = [
    {
        "name": "File Index",
        "config": {...},
        "default": True,
    },
    {
        "name": "GraphRAG Index",
        "config": {
            "__type__": "ktem.index.file.graph.GraphRAGIndex",
            "doc_store": KH_DOCSTORE,
        },
    },
]

Configure GraphRAG settings in settings.yaml:

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: azure_openai_chat
  model: gpt-4

embeddings:
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding
    model: text-embedding-ada-002

Set environment variable:

export GRAPHRAG_API_KEY="your-api-key"
export USE_CUSTOMIZED_GRAPHRAG_SETTING="true"

Local LLM Support

Run RUB RAG completely offline with local models:

Using Ollama

Install Ollama: https://ollama.ai
Pull a model:

ollama pull llama3.2
ollama pull nomic-embed-text

Configure in flowsettings.py (already included):

KH_LLMS = {
    "ollama": {
        "spec": {
            "__type__": "kotaemon.llms.ChatOpenAI",
            "base_url": "http://localhost:11434/v1",
            "model": "llama3.2",
            "api_key": "ollama",
        },
    },
}

Using llama.cpp

For GGUF model files:

# Set the model path
export LOCAL_MODEL="/path/to/model.gguf"

# Start the server
bash scripts/server_llamacpp_linux.sh  # or server_llamacpp_macos.sh

Custom Pipelines

Create custom reasoning pipelines by extending BaseComponent:

# In your custom module
from kotaemon.base import BaseComponent

class MyCustomPipeline(BaseComponent):
    """Custom reasoning pipeline"""

    def run(self, question: str, history: list, **kwargs):
        # Your custom logic here
        return {"answer": "...", "citations": [...]}

Register in flowsettings.py:

KH_REASONINGS = [
    "ktem.reasoning.simple.FullQAPipeline",
    "your_module.MyCustomPipeline",  # Add your pipeline
]

See docs/pages/app/customize-flows.md for detailed documentation.

Architecture

rubrag/
├── app.py                 # Application entry point
├── flowsettings.py        # Central configuration
├── settings.yaml          # GraphRAG configuration
├── docker-compose.yml     # Docker orchestration
├── Dockerfile             # Multi-stage Docker build
│
├── libs/
│   ├── kotaemon/          # Core RAG library
│   │   ├── agents/        # ReACT, ReWOO agents
│   │   ├── embeddings/    # Embedding model interfaces
│   │   ├── llms/          # LLM provider integrations
│   │   ├── loaders/       # Document loaders
│   │   ├── indices/       # Indexing and retrieval
│   │   └── storages/      # Vector and document stores
│   │
│   └── ktem/              # Application framework
│       ├── pages/         # UI pages (chat, settings, login)
│       ├── reasoning/     # Reasoning pipelines
│       ├── index/         # Indexing system
│       ├── db/            # Database models
│       └── assets/        # CSS, JS, images
│
├── scripts/               # Installation and utility scripts
├── templates/             # Customization templates
└── docs/                  # Documentation

Key Components

Component	Purpose
kotaemon	Core RAG framework with LLM, embedding, and storage abstractions
ktem	Application layer with Gradio UI, user management, and pipelines
flowsettings.py	Configuration hub for all components

Data Storage

Data Type	Location	Purpose
SQLite Database	`ktem_app_data/user_data/sql.db`	Users, conversations, file metadata
Vector Store	`ktem_app_data/user_data/vectorstore/`	Document embeddings (ChromaDB)
Document Store	`ktem_app_data/user_data/docstore/`	Document content (LanceDB)

Troubleshooting

Common Issues

"No module named 'kotaemon'"

# Ensure you're in the virtual environment
source venv/bin/activate

# Reinstall the libraries
pip install -e libs/kotaemon
pip install -e libs/ktem

"OPENAI_API_KEY not set"

# Set the environment variable
export OPENAI_API_KEY="sk-..."

# Or create a .env file
echo "OPENAI_API_KEY=sk-..." > .env

Docker container won't start

# Check logs
docker-compose logs

# Ensure port 7860 is available
lsof -i :7860

# Restart with fresh state
docker-compose down -v
docker-compose up --build

PDF viewer not working

# Download PDF.js manually
bash scripts/download_pdfjs.sh libs/ktem/ktem/assets/prebuilt

Out of memory errors

Reduce chunk_size in settings
Use a smaller embedding model
Enable document pagination
Increase Docker memory limit

Getting Help

Check existing documentation in docs/
Review flowsettings.py for configuration options
Open an issue at https://github.com/ArneJanning/rubrag/issues

Development

Setting Up Development Environment

# Clone and enter the repository
git clone https://github.com/ArneJanning/rubrag.git
cd rubrag

# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e libs/kotaemon
pip install -e libs/ktem

# Install development dependencies
pip install black isort flake8 mypy pytest

# Set up pre-commit hooks
pip install pre-commit
pre-commit install

Code Quality

The project uses several tools for code quality:

# Format code
black .
isort .

# Lint
flake8 .

# Type checking
mypy libs/kotaemon libs/ktem

# Run tests
pytest libs/ktem/ktem_tests
pytest libs/kotaemon/tests

Commit Standards

Commits follow the Conventional Commits specification:

feat: add new reasoning pipeline
fix: resolve PDF rendering issue
docs: update installation guide
refactor: simplify embedding interface

Valid types: build, chore, ci, docs, feat, fix, perf, refactor, revert, style, test

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

@trducng
@lone17
@taprosoft
@cin-albert

Acknowledgments

Gefördert im Rahmen des DFG SFB 1567 Virtuelle Lebenswelten, Teilprojekt A03 "Virtuelle Environments" an der Ruhr-Universität Bochum.

Built with:

Gradio - Web UI framework
LangChain - LLM framework
LlamaIndex - RAG framework
ChromaDB - Vector database
Microsoft GraphRAG - Knowledge graph RAG

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
libs		libs
scripts		scripts
templates		templates
.commitlintrc		.commitlintrc
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
flowsettings.py		flowsettings.py
pyproject.toml		pyproject.toml
settings.yaml		settings.yaml
settings.yaml.example		settings.yaml.example

License

ArneJanning/rubrag

Folders and files

Latest commit

History

Repository files navigation

RUB RAG

Table of Contents

Features

Quick Start

Using Docker (Fastest)

Using Installation Scripts

Installation

Prerequisites

Docker (Recommended)

Option 1: Docker Compose

Option 2: Docker Build

Docker Image Variants

Linux

macOS

Manual Installation

Installing Advanced Dependencies

Configuration

LLM Providers

OpenAI (Default)

Azure OpenAI

Anthropic Claude

Google Gemini

Groq

Cohere

Ollama (Local)

Embedding Models

Vector Stores

User Management

Usage

Web Interface

Uploading Documents

Chat Features

Reasoning Pipelines

Advanced Configuration

GraphRAG Integration

Local LLM Support

Using Ollama

Using llama.cpp

Custom Pipelines

Architecture

Key Components

Data Storage

Troubleshooting

Common Issues

"No module named 'kotaemon'"

"OPENAI_API_KEY not set"

Docker container won't start

PDF viewer not working

Out of memory errors

Getting Help

Development

Setting Up Development Environment

Code Quality

Commit Standards

License

Authors

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages