EpsteinBench

A Benchmark for Evaluating AI Models on Document Analysis, Forensic Detection, and Inference Under Redaction

EpsteinBench is a comprehensive benchmark for evaluating large language models on complex document analysis tasks using publicly released court documents. The benchmark tests three core capabilities:

Retrieval: Question answering from a corpus of 20,000+ documents
Inference: Predicting content hidden by redactions
Forensics: Detecting documents with faulty redactions

Features

Modern Dashboard: Dark-mode React dashboard with real-time leaderboard, model comparison, and task browser
Control Hub: Manage data downloads, task generation, and benchmark runs from a single interface
OpenRouter Integration: Evaluate any model available via OpenRouter API
LLM-as-a-Judge: Semantic evaluation using a judge model for nuanced scoring beyond exact match
Comprehensive Metrics: Tier-based retrieval scoring, calibration metrics, and forensic detection rates

Quick Start

Installation

# Clone the repository
git clone https://github.com/CMLKevin/EpsteinBench.git
cd EpsteinBench

# Install dependencies
pip install -e .

# Download spaCy model (required for task generation)
python -m spacy download en_core_web_sm

# Install dashboard dependencies
cd dashboard && npm install && cd ..

Download Data

# Download datasets from HuggingFace and GitHub
python scripts/download_datasets.py

Generate Benchmark Tasks

# Build ground truth from phelix001 extractions
python scripts/build_ground_truth.py

# Generate benchmark task files
python scripts/generate_tasks.py

Run Evaluation

# Set your OpenRouter API key
export OPENROUTER_API_KEY=your_key_here

# Evaluate a model
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4

# Evaluate with LLM-as-a-Judge scoring
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4 --run-judge

# Customize judge model and limit judged tasks
python scripts/run_evaluation.py --model openai/gpt-4o --run-judge --judge-model minimax/minimax-m2.1 --max-judge-tasks 50

# Or estimate cost first
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4 --dry-run

# Run random baseline (no API key needed)
python scripts/run_evaluation.py --baseline

Launch Dashboard

# Start both API server and frontend
python scripts/start_dashboard.py

# Or start just the API
python -m uvicorn epsteinbench.api.server:app --reload --port 8000

Access the dashboard at http://localhost:5173 and API docs at http://localhost:8000/docs.

Benchmark Overview

Modules

Module	Weight	Tasks	Description
Retrieval	35%	389	Question answering across 5 difficulty tiers
Inference	50%	14,600	Predicting redacted content with calibration
Forensics	15%	100	Classifying document redaction security

Retrieval Tiers

Tier	Difficulty	Description
T1	Easy	Single-document keyword matching
T2	Medium	Single-document semantic reasoning
T3	Hard	Multi-document synthesis
T4	Expert	Temporal chain reasoning
T5	Adversarial	Unanswerable questions (hallucination detection)

Inference Content Types

Email: Extracted email addresses
Name: Person names
Date: Dates and times
Phone: Phone numbers
Narrative: Longer text passages

Scoring

The EpsteinBench score is computed as:

Score = 35% × Retrieval + 50% × Inference + 15% × Forensics

All component scores are normalized to 0-100.

LLM-as-a-Judge Evaluation

Beyond exact-match metrics, EpsteinBench uses an LLM judge to provide semantic evaluation:

Module	Judge Criteria	Scale
Retrieval	Correctness, Completeness, Source Quality, Hallucination	0-5 each
Inference	Correctness, Reasoning Quality, Calibration	0-5 each
Forensics	Detection Accuracy, Reasoning Quality, Calibration	0-5 each

Judge scores are normalized to 0-100 and combined with automatic metrics for a comprehensive evaluation.

Supported Models

Default benchmark models via OpenRouter:

minimax/minimax-m2.1
zhipu-ai/glm-4-plus
x-ai/grok-2-1212
google/gemini-2.0-flash-exp

Additional models available:

anthropic/claude-sonnet-4
anthropic/claude-opus-4
openai/gpt-4o
openai/gpt-4o-mini
meta-llama/llama-3.3-70b-instruct
mistralai/mistral-large
deepseek/deepseek-chat
google/gemini-pro

Judge model (default): minimax/minimax-m2.1

Project Structure

EpsteinBench/
├── epsteinbench/           # Main Python package
│   ├── api/                # FastAPI backend
│   ├── benchmark/          # Task definitions and generators
│   │   ├── retrieval/
│   │   ├── inference/
│   │   └── forensics/
│   ├── data/               # Data loaders and indexers
│   ├── evaluation/         # Metrics, scoring, and LLM-as-a-Judge
│   └── models/             # OpenRouter client
├── dashboard/              # React frontend (shadcn/ui)
│   ├── src/
│   │   ├── components/     # UI components
│   │   ├── pages/          # Page components
│   │   └── hooks/          # React hooks
├── scripts/                # CLI scripts
├── benchmarks/             # Generated task files
├── results/                # Evaluation outputs
└── docs/                   # Documentation

Configuration

Environment variables:

OPENROUTER_API_KEY=your_key_here  # Required for model evaluation

Configuration file: .env or via epsteinbench/config.py

API Endpoints

Endpoint	Description
`GET /api/health`	Health check
`GET /api/stats`	Benchmark statistics
`GET /api/leaderboard`	Model rankings
`GET /api/models`	List evaluated models
`GET /api/models/{name}`	Detailed model scores
`GET /api/tasks`	Browse benchmark tasks
`GET /api/compare`	Compare multiple models
`POST /api/control/benchmark/run`	Start evaluation job (supports `run_judge`, `judge_model` options)

Full API documentation available at /docs when the server is running.

Dashboard

The dashboard features a "Forensic Noir" design theme:

Leaderboard: Real-time model rankings with score breakdowns
Control Hub: Manage data, tasks, models, and run benchmarks
Compare: Side-by-side model comparison with charts
Tasks: Browse benchmark tasks and model predictions
About: Methodology and ethical guidelines

Built with React, TypeScript, Tailwind CSS, and shadcn/ui components.

Ethical Considerations

This benchmark is built on publicly released court documents. We have taken care to:

Exclude CSAM: No content related to child sexual abuse material
Protect Victims: No identifying information about minor victims
Research Focus: Designed for academic evaluation of AI capabilities

See docs/ETHICS.md for detailed guidelines.

Citation

@misc{epsteinbench2025,
  title={EpsteinBench: A Benchmark for AI Document Analysis Under Redaction},
  year={2025},
  url={https://github.com/CMLKevin/EpsteinBench}
}

License

MIT License - see LICENSE for details.

Acknowledgments

Data sources: tensonaut/EPSTEIN_FILES_20K, phelix001/epstein-network
Built with: FastAPI, React, OpenRouter, HuggingFace Datasets, shadcn/ui

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EpsteinBench

Features

Quick Start

Installation

Download Data

Generate Benchmark Tasks

Run Evaluation

Launch Dashboard

Benchmark Overview

Modules

Retrieval Tiers

Inference Content Types

Scoring

LLM-as-a-Judge Evaluation

Supported Models

Project Structure

Configuration

API Endpoints

Dashboard

Ethical Considerations

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
dashboard		dashboard
docs		docs
epsteinbench		epsteinbench
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
EPSTEINBENCH_SPEC_V2.md		EPSTEINBENCH_SPEC_V2.md
LICENSE		LICENSE
README.md		README.md
frontend_design copy.md		frontend_design copy.md
pyproject.toml		pyproject.toml

License

CMLKevin/EpsteinBench

Folders and files

Latest commit

History

Repository files navigation

EpsteinBench

Features

Quick Start

Installation

Download Data

Generate Benchmark Tasks

Run Evaluation

Launch Dashboard

Benchmark Overview

Modules

Retrieval Tiers

Inference Content Types

Scoring

LLM-as-a-Judge Evaluation

Supported Models

Project Structure

Configuration

API Endpoints

Dashboard

Ethical Considerations

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages