Skip to content

A benchmark for AI document forensics. Some redactions hide names. Some hide dates. Some hide worse. Can your model find what someone tried to bury?

License

Notifications You must be signed in to change notification settings

CMLKevin/EpsteinBench

Repository files navigation

EpsteinBench

A Benchmark for Evaluating AI Models on Document Analysis, Forensic Detection, and Inference Under Redaction

Python 3.10+ License: MIT

EpsteinBench is a comprehensive benchmark for evaluating large language models on complex document analysis tasks using publicly released court documents. The benchmark tests three core capabilities:

  • Retrieval: Question answering from a corpus of 20,000+ documents
  • Inference: Predicting content hidden by redactions
  • Forensics: Detecting documents with faulty redactions

Features

  • Modern Dashboard: Dark-mode React dashboard with real-time leaderboard, model comparison, and task browser
  • Control Hub: Manage data downloads, task generation, and benchmark runs from a single interface
  • OpenRouter Integration: Evaluate any model available via OpenRouter API
  • LLM-as-a-Judge: Semantic evaluation using a judge model for nuanced scoring beyond exact match
  • Comprehensive Metrics: Tier-based retrieval scoring, calibration metrics, and forensic detection rates

Quick Start

Installation

# Clone the repository
git clone https://github.com/CMLKevin/EpsteinBench.git
cd EpsteinBench

# Install dependencies
pip install -e .

# Download spaCy model (required for task generation)
python -m spacy download en_core_web_sm

# Install dashboard dependencies
cd dashboard && npm install && cd ..

Download Data

# Download datasets from HuggingFace and GitHub
python scripts/download_datasets.py

Generate Benchmark Tasks

# Build ground truth from phelix001 extractions
python scripts/build_ground_truth.py

# Generate benchmark task files
python scripts/generate_tasks.py

Run Evaluation

# Set your OpenRouter API key
export OPENROUTER_API_KEY=your_key_here

# Evaluate a model
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4

# Evaluate with LLM-as-a-Judge scoring
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4 --run-judge

# Customize judge model and limit judged tasks
python scripts/run_evaluation.py --model openai/gpt-4o --run-judge --judge-model minimax/minimax-m2.1 --max-judge-tasks 50

# Or estimate cost first
python scripts/run_evaluation.py --model anthropic/claude-sonnet-4 --dry-run

# Run random baseline (no API key needed)
python scripts/run_evaluation.py --baseline

Launch Dashboard

# Start both API server and frontend
python scripts/start_dashboard.py

# Or start just the API
python -m uvicorn epsteinbench.api.server:app --reload --port 8000

Access the dashboard at http://localhost:5173 and API docs at http://localhost:8000/docs.

Benchmark Overview

Modules

Module Weight Tasks Description
Retrieval 35% 389 Question answering across 5 difficulty tiers
Inference 50% 14,600 Predicting redacted content with calibration
Forensics 15% 100 Classifying document redaction security

Retrieval Tiers

Tier Difficulty Description
T1 Easy Single-document keyword matching
T2 Medium Single-document semantic reasoning
T3 Hard Multi-document synthesis
T4 Expert Temporal chain reasoning
T5 Adversarial Unanswerable questions (hallucination detection)

Inference Content Types

  • Email: Extracted email addresses
  • Name: Person names
  • Date: Dates and times
  • Phone: Phone numbers
  • Narrative: Longer text passages

Scoring

The EpsteinBench score is computed as:

Score = 35% × Retrieval + 50% × Inference + 15% × Forensics

All component scores are normalized to 0-100.

LLM-as-a-Judge Evaluation

Beyond exact-match metrics, EpsteinBench uses an LLM judge to provide semantic evaluation:

Module Judge Criteria Scale
Retrieval Correctness, Completeness, Source Quality, Hallucination 0-5 each
Inference Correctness, Reasoning Quality, Calibration 0-5 each
Forensics Detection Accuracy, Reasoning Quality, Calibration 0-5 each

Judge scores are normalized to 0-100 and combined with automatic metrics for a comprehensive evaluation.

Supported Models

Default benchmark models via OpenRouter:

  • minimax/minimax-m2.1
  • zhipu-ai/glm-4-plus
  • x-ai/grok-2-1212
  • google/gemini-2.0-flash-exp

Additional models available:

  • anthropic/claude-sonnet-4
  • anthropic/claude-opus-4
  • openai/gpt-4o
  • openai/gpt-4o-mini
  • meta-llama/llama-3.3-70b-instruct
  • mistralai/mistral-large
  • deepseek/deepseek-chat
  • google/gemini-pro

Judge model (default): minimax/minimax-m2.1

Project Structure

EpsteinBench/
├── epsteinbench/           # Main Python package
│   ├── api/                # FastAPI backend
│   ├── benchmark/          # Task definitions and generators
│   │   ├── retrieval/
│   │   ├── inference/
│   │   └── forensics/
│   ├── data/               # Data loaders and indexers
│   ├── evaluation/         # Metrics, scoring, and LLM-as-a-Judge
│   └── models/             # OpenRouter client
├── dashboard/              # React frontend (shadcn/ui)
│   ├── src/
│   │   ├── components/     # UI components
│   │   ├── pages/          # Page components
│   │   └── hooks/          # React hooks
├── scripts/                # CLI scripts
├── benchmarks/             # Generated task files
├── results/                # Evaluation outputs
└── docs/                   # Documentation

Configuration

Environment variables:

OPENROUTER_API_KEY=your_key_here  # Required for model evaluation

Configuration file: .env or via epsteinbench/config.py

API Endpoints

Endpoint Description
GET /api/health Health check
GET /api/stats Benchmark statistics
GET /api/leaderboard Model rankings
GET /api/models List evaluated models
GET /api/models/{name} Detailed model scores
GET /api/tasks Browse benchmark tasks
GET /api/compare Compare multiple models
POST /api/control/benchmark/run Start evaluation job (supports run_judge, judge_model options)

Full API documentation available at /docs when the server is running.

Dashboard

The dashboard features a "Forensic Noir" design theme:

  • Leaderboard: Real-time model rankings with score breakdowns
  • Control Hub: Manage data, tasks, models, and run benchmarks
  • Compare: Side-by-side model comparison with charts
  • Tasks: Browse benchmark tasks and model predictions
  • About: Methodology and ethical guidelines

Built with React, TypeScript, Tailwind CSS, and shadcn/ui components.

Ethical Considerations

This benchmark is built on publicly released court documents. We have taken care to:

  • Exclude CSAM: No content related to child sexual abuse material
  • Protect Victims: No identifying information about minor victims
  • Research Focus: Designed for academic evaluation of AI capabilities

See docs/ETHICS.md for detailed guidelines.

Citation

@misc{epsteinbench2025,
  title={EpsteinBench: A Benchmark for AI Document Analysis Under Redaction},
  year={2025},
  url={https://github.com/CMLKevin/EpsteinBench}
}

License

MIT License - see LICENSE for details.

Acknowledgments

  • Data sources: tensonaut/EPSTEIN_FILES_20K, phelix001/epstein-network
  • Built with: FastAPI, React, OpenRouter, HuggingFace Datasets, shadcn/ui

About

A benchmark for AI document forensics. Some redactions hide names. Some hide dates. Some hide worse. Can your model find what someone tried to bury?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published