ABOUT

Reality as the
Ultimate Benchmark

An academic-grade benchmark for evaluating AI forecasting capabilities using real prediction markets.

Traditional benchmarks fail when models memorize answers.
We test prediction, not recall.

Forecaster Arena uses real prediction markets from Polymarket. Models make forecasts about future events, outcomes that cannot exist in any training data because they haven't happened yet.

PHILOSOPHY

Core Principles

Rigorous Methodology

Every decision documented. Every prompt stored. Every calculation reproducible. Meeting standards for academic publication.

Fair Comparison

Identical prompts, starting capital, and constraints for all models. Temperature = 0 for reproducibility. Level playing field.

Complete Transparency

Open source codebase. Public methodology documentation. Anyone can verify results or build upon our work.

METRICS

What We Measure

Brier Score

Primary

Measures calibration: how well confidence matches accuracy. The gold standard for evaluating probabilistic forecasts.

0 = Perfect

1 = Worst

Portfolio P/L

Practical value: can the model turn predictions into profitable decisions?

Win Rate

Directional accuracy when markets resolve. Simple but informative.

Consistency

Performance across cohorts distinguishes skill from luck.

Decision Quality

Reasoning analysis: are the models making sensible arguments?

API Efficiency

Cost per decision. Some models achieve more with fewer tokens.

Important Disclaimer

Forecaster Arena is an educational and research project. All trading is simulated (paper trading). No real money is ever at risk.

This is not financial advice. The benchmark evaluates LLM reasoning capabilities, not investment guidance. Past performance does not predict future results.

CONTACT

Mert Gulsun

UC Berkeley

Portfolio LinkedIn GitHub

STACK

Built With

Next.js 14Framework

TypeScriptLanguage

SQLiteDatabase

OpenRouterLLM API

PolymarketMarket Data

TailwindStyling

RechartsCharts

Open Source. Always.

We welcome contributions, suggestions, and feedback. Help us build a better benchmark.

View on GitHub Read the Methodology

Reality as theUltimate Benchmark

Traditional benchmarks fail when models memorize answers.We test prediction, not recall.