AlphaEval: Evaluating Agents in Production

🌐 Website | 📄 Paper | GitHub | 中文

A production-grounded evaluation framework for AI agents.
94 tasks from 7 companies across 6 O*NET occupational domains.

Overview of AlphaEval: bridging the gap between research benchmarks and production reality.

Highlights

From Requirements to Benchmarks — A systematic framework that transforms authentic production requirements into fully automated, reproducible evaluations. Any real-world business need can be rapidly operationalized into a rigorous benchmark.
Production-Grounded Tasks — 94 tasks preserving real-world complexity: ambiguous specifications, multi-modal inputs (PDFs, Excel, code, images), implicit constraints, and domain-expert evaluation criteria.
Multi-Paradigm Evaluation — Multiple evaluation paradigms (Reference Verification, Formal Logic, Rubric-based, Execution-based) composed per task (avg. 2.8 types/task), with Docker-sandboxed execution and LLM-as-Judge as a cross-cutting method.
Agent System Evaluation — Evaluates complete agent products (Claude Code, Codex, GitHub Copilot, Cursor), not just models. Scaffold choice matters as much as model choice.

The requirement-to-benchmark construction framework: Partner Engagement → Requirement Elicitation → Task Formalization → Iterative Evaluation.

Key Results

The best configuration (Claude Code + Opus 4.6) achieves only 64.41/100, revealing a substantial research-production gap.

Agent Product	Model	Avg Score
Claude Code	Claude Opus 4.6	64.41
Cursor	Claude Opus 4.6	61.85
GitHub Copilot	Claude Opus 4.6	61.31
GitHub Copilot	GPT-5.2	54.91
Codex	Claude Opus 4.6	53.45

Key findings:

Scaffold matters as much as model: Same Opus 4.6 scores 64.41 via Claude Code but 53.45 via Codex — an 11-point spread
Extreme domain variance: Technology Research (avg 62.0) vs Human Resources (avg 30.0)
No single score captures readiness: Inter-domain rank correlations are often statistically insignificant
Production-specific failure modes: Cascade dependency, subjective judgment collapse, information retrieval failures, cross-section logical inconsistency, constraint misinterpretation, and format compliance failures — all invisible to coding-centric benchmarks

Task Domains

Tasks are classified following the O*NET occupational taxonomy:

Domain	Tasks	Description
Human Resources	11	Resume screening against job descriptions
Finance & Investment	22	Investment research, pitch coaching, financial data extraction
Procurement & Operations	23	BOM cost optimization, procurement data analysis
Software Engineering	11	Full-stack mini-program development
Healthcare & Life Sciences	16	Clinical trial eCRF management, healthcare policy analysis
Technology Research	11	AI industry deep research, technical analysis

Quick Start

# Clone
git clone https://github.com/GAIR-NLP/AlphaEval.git
cd AlphaEval

# Configure
cp config/config.example.yaml config.yaml
# Edit config.yaml with your API keys

# Install
pip install openai pyyaml

# Run evaluation
./run_eval.sh claude-code <task_id>

Evaluation Templates

We provide 6 ready-to-use evaluation templates. Copy one and customize:

cp -r tasks/.templates/llm_judge tasks/my-new-task
# Edit task.yaml, query.md, and .eval/rubric.json

Template	When to Use	Evaluation Method
`code_exec`	Verifiable numeric/structured output	Extract answer → compare to expected value
`llm_judge`	Subjective quality assessment	LLM judges each rubric point (covered/not)
`exact_match`	Single correct answer	String or numeric matching
`f1_match`	Select items from a set	Precision / Recall / F1 against ground truth
`hybrid`	Numeric accuracy + qualitative quality	Numerical verification + LLM-as-Judge
`ui_testing`	Agent builds a web/mobile app	Playwright headless browser + screenshots

Taxonomy of evaluation methodologies. AlphaEval covers multiple paradigms, composing them per task.

See Task Creation Guide for step-by-step instructions and examples/ for fictional demonstration tasks.

Task Structure

tasks/<task-name>/
├── task.yaml              # Metadata: name, category, difficulty, evaluation config
├── query.md               # Task prompt given to the agent
├── files/                 # Input files (PDFs, Excel, images, code, etc.)
└── .eval/
    ├── rubric.py          # Evaluation script
    ├── rubric.json        # Rubric criteria (for llm_judge / hybrid)
    └── ground_truth.json  # Ground truth (for f1_match / code_exec)

Supported Agents

Agent	Type	Description
Claude Code	CLI	Anthropic's agentic coding tool
Codex	CLI	OpenAI's coding agent
GitHub Copilot	CLI	GitHub's coding agent
Cursor	CLI	Cursor's AI coding agent

All agents are invoked via CLI within Docker-sandboxed environments with full output trajectory recording.

Paper

📄 AlphaEval: Evaluating Agents in Production (v0)

This is an early version of the paper. It will be further revised and improved. An arXiv version will be released soon.

Citation

@article{alphaeval2026,
  title={AlphaEval: Evaluating Agents in Production},
  author={Anonymous},
  year={2026}
}

Acknowledgements

We thank Keyu Li, Tianze Xu, and Zhen Huang for their valuable contributions to this project.

Contact

For questions or collaboration inquiries, please contact: lupengrui@sjtu.edu.cn

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
config		config
docker		docker
examples		examples
paper		paper
scripts		scripts
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
run_eval.sh		run_eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlphaEval: Evaluating Agents in Production

Highlights

Key Results

Task Domains

Quick Start

Evaluation Templates

Task Structure

Supported Agents

Paper

Citation

Acknowledgements

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

AlphaEval: Evaluating Agents in Production

Highlights

Key Results

Task Domains

Quick Start

Evaluation Templates

Task Structure

Supported Agents

Paper

Citation

Acknowledgements

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages