A multi-agent Retrieval-Augmented Generation (RAG) system built with modern architecture patterns. The system processes complex queries through specialized AI agents, featuring real-time streaming responses, hybrid AI provider support, persistent memory management, and advanced tool orchestration.
Built with enterprise-grade reliability using clean architecture principles, the system separates business logic from presentation layers, enabling excellent testability and maintainability.
- 🤖 Multi-Agent Orchestration: Specialized agents handle query decomposition, information gathering, extraction, and response generation
- ☁️ Hybrid AI Provider Support: Seamlessly switch between local (Ollama), cloud (Groq/OpenAI), and Anthropic Claude providers
- 🧠 Persistent Memory System: Cross-session memory using FAISS vector storage and OpenAI embeddings
- ⚙️ Enterprise Configuration: Centralized TOML-based configuration with runtime reloading and environment management
- ⚡ Real-time Streaming: Asynchronous streaming responses with "thinking" process visualization and non-blocking execution
- 🔍 Advanced Search Integration: Semantic search, full-text search, and metadata filtering through Elasticsearch
- 🛠️ Intelligent Tool System: Automatic tool calling with Google-style docstring parsing and retry mechanisms
- 💬 Modern Chat Interface: Streamlit-powered interface with structured reasoning visualization and responsive design
- 🏗️ Clean Architecture: Business logic separated from UI for excellent testability and maintainability
py-jrag implements a clean, modular architecture with the RAGOrchestrator as the core business logic component. The system processes complex queries through multiple specialized agents with support for both streaming and non-streaming responses.
- RAGOrchestrator (
app/orchestrator.py) - Core business logic orchestrating the entire RAG workflow - Specialized Agents (
agents/) - Domain-specific AI agents for different processing tasks - Memory System (
memory/) - Persistent conversation memory with FAISS vector storage - AI Clients (
configuration/) - Multi-provider AI integration (OpenAI, Groq, Ollama, Anthropic Claude) - Configuration System (
configuration/configuration.py) - Enterprise-grade TOML-based configuration management
flowchart TD
A[👤 User Query] --> B[🎭 RAGOrchestrator]
B --> C[🔍 Decoupler Agent]
C --> D[❓ Sub-Questions]
D --> E[🛠️ Crafter Agent]
E --> F1[🔍 Semantic Search]
E --> F2[📄 Full-Text Search]
E --> F3[🏷️ Metadata Search]
F1 --> G[📊 Elasticsearch]
F2 --> G
F3 --> G
G --> H[📝 Extractor Agent]
H --> I[🗣️ Speaker Agent]
I --> J[💬 Streaming Response]
K[🌿 Linden Framework] -.-> C
K -.-> E
K -.-> H
K -.-> I
L[🤖 AI Providers] -.-> K
M[💾 FAISS Memory] -.-> K
style A fill:#e1f5fe
style J fill:#e8f5e8
style K fill:#fff3e0
style L fill:#fce4ec
style M fill:#f3e5f5
Flow: User Query → Decoupler → Crafter → Extractor → Speaker → Response
The RAGOrchestrator is the heart of the system, providing clean separation between business logic and UI.
Key Methods:
process_query(): Complete RAG processing pipeline that returns a final responseprocess_query_streaming(): RAG processing with real-time token streaming for interactive UIsreset_crafter(): Clean crafter agent state between queries to prevent context contamination
Benefits of the Architecture:
- ✅ Testable: Business logic can be unit tested independently from UI components
- ✅ Maintainable: Clear separation of concerns with modular agent system
- ✅ Reusable: Orchestrator works with any interface (Streamlit, API, CLI)
- ✅ Type-Safe: Full type annotations and Pydantic models for reliable data handling
Built on Linden Framework: py-jrag leverages the Linden framework's AgentRunner infrastructure, extending it with specialized agents and domain-specific tools.
py-jrag uses Linden's core capabilities:
- AgentRunner: Base class providing agent lifecycle management, streaming, and tool integration
- Provider Interface: Unified interface for multiple AI providers (Claude, OpenAI, Groq, Ollama)
- Configuration System: TOML-based configuration with type safety and validation
- Memory Management: Integrated FAISS-based persistent memory across sessions
Each agent extends Linden's AgentRunner with domain-specific functionality:
Purpose: Intelligent query decomposition with conversational context awareness
Key Capabilities:
- Context Resolution: Resolves pronouns and vague references using conversation history
- Smart Decomposition: Breaks complex questions into independent, actionable sub-questions
- Session Awareness: Handles session-specific queries and comparisons intelligently
- Minimal Splitting: Uses minimum number of sub-questions required
- Data Compliance: Includes relevant metadata keywords (session_id, product_id, user_id, start_time)
Purpose: Intelligent tool orchestration for comprehensive information retrieval
Search Tools:
-
Semantic Search
- Semantic Understanding: Uses advanced embeddings for concept-based search
- Cosine Similarity: Advanced similarity matching for contextual relevance
- Natural Language: Supports complex natural language queries
- Configurable Results: Flexible result count with performance optimization
-
Full Text Search
- Exact Matching: Precise keyword and phrase matching
- Elasticsearch Integration: Leverages full-text search capabilities
- High Precision: Ideal for specific term searches
- Performance Optimized: Fast retrieval with indexed search
-
Metadata Search
- Structured Filtering: Multi-field metadata filtering with boolean logic
- Schema Validation: Strict input validation for supported metadata fields
- Flexible Queries: Supports partial and complete metadata combinations
- Session Tracking: Specialized for session-based data analysis
Purpose: Context processing and information synthesis for long-form content
Key Features:
- Content Summarization: Processes large context blocks when information exceeds manageable size
- Intelligent Filtering: Automatically triggered when context contains more than 2 document segments
- Structured Output: Returns processed information with summary field
- Context Awareness: Maintains question relevance while condensing information
Purpose: Final response generation with streaming support and advanced reasoning
Advanced Capabilities:
- Real-time Streaming: Token-level streaming for immediate user feedback with proper error handling
- Thinking Process: Special
<think>tags for reasoning visualization in collapsible UI steps - High-Quality Models: Uses configurable models (Claude Sonnet 4 by default) for superior response quality
- Context Integration: Synthesizes information from all previous agents with conversation awareness
- Memory-Aware Responses: Maintains chat history for contextual and personalized responses
Powered by Linden's Multi-Provider System: py-jrag leverages Linden's unified provider interface:
- Latest Models: Claude Sonnet 4 support with advanced reasoning capabilities
- Streaming Support: Real-time token streaming with proper error handling
- Tool Integration: Native function calling with structured JSON responses
- Production Ready: Enterprise-grade error handling and rate limiting
- Privacy-First Local Execution: Complete data privacy with local model hosting
- Real-time Streaming: Token-level streaming via chunked HTTP responses
- Resource Optimization: Efficient local inference with customizable parameters
- Ultra-Fast Cloud Inference: Specialized infrastructure for rapid responses
- Production-Grade Streaming: Real-time streaming with chunked delivery and error recovery
- Robust Error Handling: Comprehensive error recovery and automatic retry mechanisms
- GPT Integration: Full support for GPT-4 and GPT-3.5-turbo models
- Advanced Embedding Support: Using text-embedding-3-small for memory system
- Enterprise Configuration: Comprehensive API key management and billing control
The system uses a centralized TOML-based configuration with the ConfigManager class:
[models]
dec = "claude-sonnet-4-20250514" # Decoupler agent model
tool = "claude-sonnet-4-20250514" # Crafter agent model
extractor = "claude-sonnet-4-20250514" # Extractor agent model
speaker = "claude-sonnet-4-20250514" # Speaker agent model
[groq]
base_url = "https://api.groq.com/"
api_key = "" # Required for Groq provider
timeout = 120
[ollama]
timeout = 120 # Local model timeout
[openai]
api_key = "" # Required for embeddings and OpenAI models
timeout = 120
[anthropic]
api_key = "" # Required for Claude models
timeout = 120
max_tokens = 2048
[elasticsearch]
scheme = "https"
host = "localhost"
port = 9200
auth_name = "elastic"
auth_pwd = "changeme"
index_name = "webflow"
[memory]
path = "./memory/faiss/faiss_memories" # FAISS vector storage path
collection_name = "py-jrag"- Singleton Pattern: Single configuration instance across the application
- Runtime Reloading:
ConfigManager.reload()for configuration updates - Environment Variables: Support for overriding configuration via environment variables
- Type Safety: Strongly typed configuration with validation
- Default Values: Sensible defaults for non-critical settings
Built on Linden's Memory Infrastructure: py-jrag uses Linden's integrated memory management:
Key Components:
- Vector Storage: FAISS-based vector database for semantic memory retrieval
- Embeddings: OpenAI text-embedding-3-small for high-quality semantic understanding
- Agent Isolation: Per-agent memory spaces prevent cross-contamination
- Conversation Retrieval: Retrieves relevant past interactions
- Automatic Recording: Stores interactions with context inference
- Memory Reset: Clean slate functionality per agent
Memory Configuration:
config = {
"embedder": {
"provider": "openai",
"config": {
"model": "text-embedding-3-small",
"api_key": conf.openai.api_key
}
},
"vector_store": {
"provider": "faiss",
"config": {
"collection_name": "py-jrag",
"path": conf.memory.path
}
}
}The system underwent comprehensive evaluation using a golden dataset of 34 queries across 4 distinct categories:
| Metric Category | Average Score | Performance Level |
|---|---|---|
| Context Precision | 0.8824 | Excellent (88.24% of retrieved docs relevant) |
| Context Recall | 0.6690 | Good (66.90% of relevant info retrieved) |
| Human Faithfulness | 4.9706 | Outstanding (Nearly perfect fidelity) |
| LLM Faithfulness | 4.7353 | Excellent (High automated agreement) |
| Human Relevancy | 4.4118 | Very Good (Highly relevant responses) |
| LLM Relevancy | 4.6765 | Excellent (Strong automated relevance) |
| Human Completeness | 4.2059 | Good (Comprehensive information coverage) |
| LLM Completeness | 4.0000 | Good (Adequate automated completeness) |
| Human Clarity | 4.5882 | Excellent (Very clear communication) |
| LLM Clarity | 4.9118 | Outstanding (Exceptional automated clarity) |
| Category | Precision | Recall | Human Avg | LLM Avg | Performance Notes |
|---|---|---|---|---|---|
| Product Queries | 1.000 | 0.958 | 4.42 | 4.54 | 🏆 Best Overall: Perfect precision, excellent recall |
| User Queries | 1.000 | 0.667 | 4.58 | 4.78 | 🎯 High Precision: Perfect document relevance |
| Generic Queries | 0.833 | 0.542 | 4.74 | 4.59 | 📚 Balanced: Good general performance |
| Session Queries | 0.714 | 0.643 | 4.35 | 4.25 | ⚡ Challenging: Complex temporal queries |
- Python 3.10+
- Elasticsearch cluster (local or cloud)
- API keys for chosen providers:
- Anthropic API key (recommended for production)
- OpenAI API key (required for embeddings)
- Groq API key (optional, for high-speed inference)
- Ollama (optional, for local inference)
-
Clone Repository
git clone <repository-url> cd py-jrag
-
Install Dependencies
pip install -r requirements.txt
-
Configure Application
- Copy
config.tomland update with your API keys and settings - Configure Elasticsearch connection details
- Set memory storage path
- Copy
-
Setup Elasticsearch
- Start Elasticsearch cluster
- Create index with appropriate mappings
- Configure authentication credentials
-
Setup Local Models (Optional)
- Install Ollama if using local inference
- Pull desired models (e.g.,
ollama pull llama2)
-
Run Application
streamlit run streamlit_app.py
-
Start Streamlit Server
streamlit run streamlit_app.py
-
Access Web Interface
- Open browser to
http://localhost:8501 - Start chatting with the py-jrag system
- Open browser to
-
Monitor Logs
- Check console for detailed agent execution logs
- Debug mode provides comprehensive tracking and performance metrics
The RAGOrchestrator can be imported and used directly in Python code:
from app.orchestrator import RAGOrchestrator
from configuration.configuration import ConfigManager
# Initialize configuration
ConfigManager.initialize(config_path="config.toml")
# Create orchestrator
orchestrator = RAGOrchestrator()
# Process query
response = await orchestrator.process_query("Your question here")
print(response.content)
# Or use streaming
async for chunk in orchestrator.process_query_streaming("Your question here"):
if chunk.content:
print(chunk.content, end="")# Run unit tests
python -m pytest test/app/test_orchestrator.py -v
# Run integration tests (requires proper setup)
python -m pytest test/ -m integration
# Run evaluation suite
python evaluation/1_rag_evaluation_runner.py # Generate responses
python evaluation/2_llm_as_judge.py # LLM evaluation
python evaluation/3_metrics.py # Calculate metrics# Validate architecture setup
python validate_architecture.pypy-jrag/
├── app/ # Core business logic
│ ├── orchestrator.py # Main RAG orchestrator
│ └── models.py # Data structures
├── agents/ # Specialized AI agents
│ ├── decoupler.py # Query decomposition
│ ├── crafter.py # Information gathering
│ ├── extractor.py # Context processing
│ └── speaker.py # Response generation
├── configuration/ # Configuration management
│ └── configuration.py # Config system
├── elastic/ # Elasticsearch integration
│ └── elastic.py # Search client
├── memory/ # Persistent memory storage
├── evaluation/ # Testing and evaluation
├── test/ # Unit and integration tests
├── assets/ # Static assets (logo, etc.)
├── streamlit_app.py # Web interface
├── config.toml # Configuration file
└── requirements.txt # Dependencies

