A Benchmark for Evaluating AI Models on Document Analysis, Forensic Detection, and Inference Under Redaction
EpsteinBench is a comprehensive benchmark for evaluating large language models on complex document analysis tasks using publicly released court documents. The benchmark tests three core capabilities:
- Retrieval: Question answering from a corpus of 20,000+ documents
- Inference: Predicting content hidden by redactions
- Forensics: Detecting documents with faulty redactions
- Modern Dashboard: Dark-mode React dashboard with real-time leaderboard, model comparison, and task browser
- Control Hub: Manage data downloads, task generation, and benchmark runs from a single interface
- OpenRouter Integration: Evaluate any model available via OpenRouter API
- LLM-as-a-Judge: Semantic evaluation using a judge model for nuanced scoring beyond exact match
- Comprehensive Metrics: Tier-based retrieval scoring, calibration metrics, and forensic detection rates
# Clone the repository
git clone https://github.com/CMLKevin/EpsteinBench.git
cd EpsteinBench
# Install dependencies
pip install -e .
# Download spaCy model (required for task generation)
python -m spacy download en_core_web_sm
# Install dashboard dependencies
cd dashboard && npm install && cd ..# Download datasets from HuggingFace and GitHub
python scripts/download_datasets.py# Build ground truth from phelix001 extractions
python scripts/build_ground_truth.py
# Generate benchmark task files
python scripts/generate_tasks.py# Set your OpenRouter API key
export OPENROUTER_API_KEY=your_key_here
# Evaluate a model
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4
# Evaluate with LLM-as-a-Judge scoring
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4 --run-judge
# Customize judge model and limit judged tasks
python scripts/run_evaluation.py --model openai/gpt-4o --run-judge --judge-model minimax/minimax-m2.1 --max-judge-tasks 50
# Or estimate cost first
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4 --dry-run
# Run random baseline (no API key needed)
python scripts/run_evaluation.py --baseline# Start both API server and frontend
python scripts/start_dashboard.py
# Or start just the API
python -m uvicorn epsteinbench.api.server:app --reload --port 8000Access the dashboard at http://localhost:5173 and API docs at http://localhost:8000/docs.
| Module | Weight | Tasks | Description |
|---|---|---|---|
| Retrieval | 35% | 389 | Question answering across 5 difficulty tiers |
| Inference | 50% | 14,600 | Predicting redacted content with calibration |
| Forensics | 15% | 100 | Classifying document redaction security |
| Tier | Difficulty | Description |
|---|---|---|
| T1 | Easy | Single-document keyword matching |
| T2 | Medium | Single-document semantic reasoning |
| T3 | Hard | Multi-document synthesis |
| T4 | Expert | Temporal chain reasoning |
| T5 | Adversarial | Unanswerable questions (hallucination detection) |
- Email: Extracted email addresses
- Name: Person names
- Date: Dates and times
- Phone: Phone numbers
- Narrative: Longer text passages
The EpsteinBench score is computed as:
Score = 35% × Retrieval + 50% × Inference + 15% × Forensics
All component scores are normalized to 0-100.
Beyond exact-match metrics, EpsteinBench uses an LLM judge to provide semantic evaluation:
| Module | Judge Criteria | Scale |
|---|---|---|
| Retrieval | Correctness, Completeness, Source Quality, Hallucination | 0-5 each |
| Inference | Correctness, Reasoning Quality, Calibration | 0-5 each |
| Forensics | Detection Accuracy, Reasoning Quality, Calibration | 0-5 each |
Judge scores are normalized to 0-100 and combined with automatic metrics for a comprehensive evaluation.
Default benchmark models via OpenRouter:
minimax/minimax-m2.1zhipu-ai/glm-4-plusx-ai/grok-2-1212google/gemini-2.0-flash-exp
Additional models available:
anthropic/claude-sonnet-4anthropic/claude-opus-4openai/gpt-4oopenai/gpt-4o-minimeta-llama/llama-3.3-70b-instructmistralai/mistral-largedeepseek/deepseek-chatgoogle/gemini-pro
Judge model (default): minimax/minimax-m2.1
EpsteinBench/
├── epsteinbench/ # Main Python package
│ ├── api/ # FastAPI backend
│ ├── benchmark/ # Task definitions and generators
│ │ ├── retrieval/
│ │ ├── inference/
│ │ └── forensics/
│ ├── data/ # Data loaders and indexers
│ ├── evaluation/ # Metrics, scoring, and LLM-as-a-Judge
│ └── models/ # OpenRouter client
├── dashboard/ # React frontend (shadcn/ui)
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── pages/ # Page components
│ │ └── hooks/ # React hooks
├── scripts/ # CLI scripts
├── benchmarks/ # Generated task files
├── results/ # Evaluation outputs
└── docs/ # Documentation
Environment variables:
OPENROUTER_API_KEY=your_key_here # Required for model evaluationConfiguration file: .env or via epsteinbench/config.py
| Endpoint | Description |
|---|---|
GET /api/health |
Health check |
GET /api/stats |
Benchmark statistics |
GET /api/leaderboard |
Model rankings |
GET /api/models |
List evaluated models |
GET /api/models/{name} |
Detailed model scores |
GET /api/tasks |
Browse benchmark tasks |
GET /api/compare |
Compare multiple models |
POST /api/control/benchmark/run |
Start evaluation job (supports run_judge, judge_model options) |
Full API documentation available at /docs when the server is running.
The dashboard features a "Forensic Noir" design theme:
- Leaderboard: Real-time model rankings with score breakdowns
- Control Hub: Manage data, tasks, models, and run benchmarks
- Compare: Side-by-side model comparison with charts
- Tasks: Browse benchmark tasks and model predictions
- About: Methodology and ethical guidelines
Built with React, TypeScript, Tailwind CSS, and shadcn/ui components.
This benchmark is built on publicly released court documents. We have taken care to:
- Exclude CSAM: No content related to child sexual abuse material
- Protect Victims: No identifying information about minor victims
- Research Focus: Designed for academic evaluation of AI capabilities
See docs/ETHICS.md for detailed guidelines.
@misc{epsteinbench2025,
title={EpsteinBench: A Benchmark for AI Document Analysis Under Redaction},
year={2025},
url={https://github.com/CMLKevin/EpsteinBench}
}MIT License - see LICENSE for details.
- Data sources: tensonaut/EPSTEIN_FILES_20K, phelix001/epstein-network
- Built with: FastAPI, React, OpenRouter, HuggingFace Datasets, shadcn/ui