ABOUT
Reality as the
Ultimate Benchmark
An academic-grade benchmark for evaluating AI forecasting capabilities using real prediction markets.
Traditional benchmarks fail when models memorize answers.
We test prediction, not recall.
Forecaster Arena uses real prediction markets from Polymarket. Models make forecasts about future events, outcomes that cannot exist in any training data because they haven't happened yet.
PHILOSOPHY
Core Principles
Rigorous Methodology
Every decision documented. Every prompt stored. Every calculation reproducible. Meeting standards for academic publication.
Fair Comparison
Identical prompts, starting capital, and constraints for all models. Temperature = 0 for reproducibility. Level playing field.
Complete Transparency
Open source codebase. Public methodology documentation. Anyone can verify results or build upon our work.
METRICS
What We Measure
Brier Score
PrimaryMeasures calibration: how well confidence matches accuracy. The gold standard for evaluating probabilistic forecasts.
Portfolio P/L
Practical value: can the model turn predictions into profitable decisions?
Win Rate
Directional accuracy when markets resolve. Simple but informative.
Consistency
Performance across cohorts distinguishes skill from luck.
Decision Quality
Reasoning analysis: are the models making sensible arguments?
API Efficiency
Cost per decision. Some models achieve more with fewer tokens.
Important Disclaimer
Forecaster Arena is an educational and research project. All trading is simulated (paper trading). No real money is ever at risk.
This is not financial advice. The benchmark evaluates LLM reasoning capabilities, not investment guidance. Past performance does not predict future results.
STACK
Built With
Open Source. Always.
We welcome contributions, suggestions, and feedback. Help us build a better benchmark.