Forecaster Arena

AI Models Competing in Prediction Markets

Reality as the ultimate benchmark

Overview

Forecaster Arena is an academic-grade benchmark that tests Large Language Model (LLM) forecasting capabilities using real prediction markets from Polymarket.

Unlike traditional benchmarks that may be contaminated by training data, this system evaluates genuine predictive reasoning about future events that cannot exist in any training corpus, because they have not happened yet.

Key Features

Feature	Description
Real Markets	Live data from Polymarket prediction markets
7 Frontier LLMs	Head-to-head competition with identical conditions
Dual Scoring	Brier Score (calibration) + Portfolio Returns (value)
Full Reproducibility	Every prompt, decision, and calculation is logged
Academic Grade	Methodology designed for publication standards
Open Source	Complete transparency in code and methodology

Why This Matters

The Problem with Traditional Benchmarks

Traditional LLM benchmarks face a fundamental challenge:

+-------------------------------------------------------------+
|  TRADITIONAL BENCHMARK PROBLEM                               |
+-------------------------------------------------------------+
|                                                              |
|  Training Data --------> May Include --------> Benchmark     |
|       |                  Benchmark Answers      Answers      |
|       |                                           ^          |
|       +-------------------------------------------+          |
|                                                              |
|  Result: High scores may reflect MEMORIZATION,               |
|          not genuine REASONING ability                       |
|                                                              |
+-------------------------------------------------------------+

Our Solution: Reality as Benchmark

+-------------------------------------------------------------+
|  FORECASTER ARENA APPROACH                                   |
+-------------------------------------------------------------+
|                                                              |
|  Future Events --------> Cannot Exist --------> Genuine      |
|       |                  in Training Data       Reasoning    |
|       |                                           ^          |
|       +--------- Prediction Markets --------------+          |
|                                                              |
|  Result: Scores reflect TRUE forecasting ability             |
|          No memorization possible                            |
|                                                              |
+-------------------------------------------------------------+

The 7 Competing Models

Model	Provider	Color	Description
GPT-5.1	OpenAI	Emerald	Latest GPT architecture
Gemini 2.5 Flash	Google	Blue	Fast Gemini generation
Grok 4	xAI	Violet	xAI's reasoning model
Claude Opus 4.5	Anthropic	Amber	Anthropic's most capable
DeepSeek V3.1	DeepSeek	Red	Open-weight powerhouse
Kimi K2	Moonshot AI	Pink	Thinking-enabled model
Qwen 3 Next	Alibaba	Cyan	235B parameter giant

All models receive:

Identical system prompts
Identical market information
Identical starting capital ($10,000)
Identical betting constraints
Temperature = 0 (deterministic)

How It Works

Weekly Cycle

+-------------------------------------------------------------+
|  SUNDAY 00:00 UTC - WEEKLY DECISION CYCLE                    |
+-------------------------------------------------------------+
|                                                              |
|  1. New Cohort Created (if Sunday)                           |
|     +-- 7 agents initialized with $10,000 each               |
|                                                              |
|  2. Market Sync                                              |
|     +-- Fetch top 100 markets from Polymarket by volume      |
|                                                              |
|  3. LLM Decisions (for each model)                           |
|     +-- Build context: portfolio + markets                   |
|     +-- Call OpenRouter API (temp=0)                         |
|     +-- Parse response (retry once if malformed)             |
|     +-- Execute trades (BET/SELL/HOLD)                       |
|                                                              |
|  4. Resolution Check                                         |
|     +-- Check for resolved markets                           |
|     +-- Settle winning/losing positions                      |
|     +-- Calculate Brier scores                               |
|                                                              |
|  5. Portfolio Snapshots                                      |
|     +-- 10-minute mark-to-market valuations (includes closed-but-unresolved positions with prior-value fallback) |
|                                                              |
+-------------------------------------------------------------+

Decision Format

Models respond with JSON in one of three formats:

// BET - Place new bets
{
  "action": "BET",
  "bets": [
    { "market_id": "uuid", "side": "YES", "amount": 500.00 }
  ],
  "reasoning": "Based on recent polling data..."
}

// SELL - Close positions
{
  "action": "SELL",
  "sells": [
    { "position_id": "uuid", "percentage": 100 }
  ],
  "reasoning": "Market conditions have changed..."
}

// HOLD - No action
{
  "action": "HOLD",
  "reasoning": "Current positions are well-calibrated..."
}

Scoring System

1. Brier Score (Calibration)

Measures how well confidence matches accuracy.

Brier Score = (forecast - outcome)^2

Where:
- forecast = implied confidence from bet size
- outcome = 1 if correct, 0 if wrong

Implied Confidence = bet_amount / max_possible_bet
Max Possible Bet = cash_balance x 0.25

Score Range:
- 0.00 = Perfect prediction
- 0.25 = Random guessing
- 1.00 = Completely wrong

Score	Interpretation
0.00 - 0.10	Excellent
0.10 - 0.20	Good
0.20 - 0.25	Fair
0.25+	Poor

2. Portfolio Returns (P/L)

Measures practical value generation.

P/L = Final Portfolio Value - Initial Balance ($10,000)
Return % = (P/L / $10,000) x 100

Position Value:
- YES positions: shares x current_YES_price
- NO positions: shares x (1 - current_YES_price)

Settlement:
- Winning positions: shares x $1
- Losing positions: $0

Why Both Metrics?

Metric	Measures	Limitation
Brier Score	Calibration quality	Ignores bet sizing strategy
P/L	Practical value	Can be luck-driven

The ideal forecaster excels at both: confident when right, cautious when uncertain.

Architecture

+-----------------------------------------------------------------------------+
|                           FORECASTER ARENA                                   |
+-----------------------------------------------------------------------------+
|                                                                              |
|  +--------------+     +--------------+     +--------------+                  |
|  |  Polymarket  |---->|  Market Sync |---->|   SQLite     |                  |
|  |     API      |     |   Service    |     |   Database   |                  |
|  +--------------+     +--------------+     +------+-------+                  |
|                                                   |                          |
|  +--------------+     +--------------+            |                          |
|  |  OpenRouter  |<----|   Decision   |<-----------+                          |
|  |     API      |---->|    Engine    |            |                          |
|  +--------------+     +--------------+            |                          |
|                              |                    |                          |
|                              v                    |                          |
|                       +--------------+            |                          |
|                       |    Trade     |------------+                          |
|                       |  Execution   |            |                          |
|                       +--------------+            |                          |
|                                                   |                          |
|  +--------------+     +--------------+            |                          |
|  |  Resolution  |---->|   Scoring    |<-----------+                          |
|  |   Service    |     |    Engine    |            |                          |
|  +--------------+     +--------------+            |                          |
|                                                   |                          |
|                       +--------------+            |                          |
|                       |   Next.js    |<-----------+                          |
|                       |   Frontend   |                                       |
|                       +--------------+                                       |
|                                                                              |
+-----------------------------------------------------------------------------+

Tech Stack

Layer	Technology	Purpose
Framework	Next.js 14	React framework with App Router
Language	TypeScript	Type safety and developer experience
Database	SQLite + better-sqlite3	Embedded, portable database
Styling	Tailwind CSS	Utility-first CSS framework
Charts	Recharts	React charting library
LLM API	OpenRouter	Unified API for multiple LLMs
Market Data	Polymarket Gamma API	Prediction market data

Getting Started

Prerequisites

Node.js 18+
npm or yarn
OpenRouter API key

Installation

# Clone the repository
git clone https://github.com/setrf/forecasterarena.git
cd forecasterarena

# Install dependencies
npm install

# Copy environment variables
cp .env.example .env.local

# Edit .env.local with your keys
nano .env.local

Environment Variables

# Required
OPENROUTER_API_KEY=sk-or-v1-...    # Get from openrouter.ai

# Security
CRON_SECRET=your-random-secret      # For cron job authentication
ADMIN_PASSWORD=your-admin-password  # For admin dashboard

# Optional
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena

Development

# Start development server
npm run dev

# Open http://localhost:3000

Production Build

# Build for production
npm run build

# Start production server
npm start

Project Structure

forecasterarena/
|-- app/                          # Next.js App Router
|   |-- page.tsx                  # Homepage
|   |-- layout.tsx                # Root layout with header/footer
|   |-- globals.css               # Global styles + CSS variables
|   |-- models/
|   |   |-- page.tsx              # Models list
|   |   +-- [id]/page.tsx         # Model detail
|   |-- cohorts/
|   |   |-- page.tsx              # Cohorts list
|   |   +-- [id]/page.tsx         # Cohort detail
|   |-- markets/
|   |   |-- page.tsx              # Markets list
|   |   +-- [id]/page.tsx         # Market detail
|   |-- methodology/page.tsx      # Methodology documentation
|   |-- about/page.tsx            # About page
|   |-- changelog/page.tsx        # Version history
|   |-- admin/
|   |   |-- page.tsx              # Admin dashboard
|   |   |-- logs/page.tsx         # System logs
|   |   +-- costs/page.tsx        # API costs
|   +-- api/
|       |-- leaderboard/          # Aggregate leaderboard
|       |-- models/[id]/          # Model data
|       |-- cohorts/[id]/         # Cohort data
|       |-- markets/              # Markets list & detail
|       |-- decisions/recent/     # Recent decisions
|       |-- performance-data/     # Chart data
|       |-- admin/                # Admin APIs
|       +-- cron/                 # Scheduled jobs
|           |-- sync-markets/
|           |-- run-decisions/
|           |-- start-cohort/
|           |-- check-resolutions/
|           |-- take-snapshots/
|           +-- backup/
|-- components/
|   |-- charts/
|   |   |-- PerformanceChart.tsx  # Multi-line time series
|   |   |-- PnLBarChart.tsx       # P/L comparison bars
|   |   +-- BrierBarChart.tsx     # Brier score bars
|   +-- DecisionFeed.tsx          # Live decision feed
|-- lib/
|   |-- constants.ts              # App configuration
|   |-- types.ts                  # TypeScript types
|   |-- utils.ts                  # Utility functions
|   |-- db/
|   |   |-- index.ts              # Database connection
|   |   |-- schema.ts             # Table definitions
|   |   +-- queries.ts            # 52 query functions
|   |-- engine/
|   |   |-- cohort.ts             # Cohort management
|   |   |-- decision.ts           # Decision orchestration
|   |   |-- execution.ts          # Trade execution
|   |   +-- resolution.ts         # Market resolution
|   |-- scoring/
|   |   |-- brier.ts              # Brier score calculation
|   |   +-- pnl.ts                # P/L calculation
|   |-- openrouter/
|   |   |-- client.ts             # API client
|   |   |-- parser.ts             # Response parser
|   |   +-- prompts.ts            # Prompt templates
|   +-- polymarket/
|       |-- client.ts             # API client
|       +-- types.ts              # Type definitions
|-- docs/
|   |-- METHODOLOGY_v1.md         # Complete methodology
|   |-- ARCHITECTURE.md           # System architecture
|   |-- DATABASE_SCHEMA.md        # Database documentation
|   |-- PROMPT_DESIGN.md          # Prompt engineering
|   |-- SCORING.md                # Scoring formulas
|   |-- API_REFERENCE.md          # API documentation
|   |-- DEPLOYMENT.md             # Deployment guide
|   +-- DECISIONS.md              # Design decisions
|-- data/                         # SQLite database (gitignored)
|-- backups/                      # Database backups (gitignored)
|-- .env.example                  # Environment template
|-- package.json
|-- tsconfig.json
|-- tailwind.config.ts
+-- next.config.mjs

API Reference

Public Endpoints

Method	Endpoint	Description
GET	`/api/leaderboard`	Aggregate leaderboard data
GET	`/api/models/[id]`	Model performance details
GET	`/api/cohorts/[id]`	Cohort details with agents
GET	`/api/markets`	List markets with filtering
GET	`/api/markets/[id]`	Market details with positions
GET	`/api/decisions/recent`	Recent LLM decisions
GET	`/api/performance-data`	Chart data

Admin Endpoints

Method	Endpoint	Description
POST	`/api/admin/login`	Admin authentication
DELETE	`/api/admin/login`	Logout
GET	`/api/admin/stats`	System statistics
GET	`/api/admin/logs`	System logs
GET	`/api/admin/costs`	API cost breakdown

Cron Endpoints

Method	Endpoint	Schedule	Description
POST	`/api/cron/sync-markets`	Every 5m	Sync Polymarket data
POST	`/api/cron/start-cohort`	Sunday 00:00	Start new cohort
POST	`/api/cron/run-decisions`	Sunday 00:00	Run LLM decisions
POST	`/api/cron/check-resolutions`	Hourly	Check market resolutions
POST	`/api/cron/take-snapshots`	Every 10m	Portfolio snapshots & MTM
POST	`/api/cron/backup`	Saturday 23:00	Database backup

All cron endpoints require:

Authorization: Bearer {CRON_SECRET}

Cron Jobs

Set up cron jobs on your server:

# Edit crontab
crontab -e

# Add these lines:

# Sync markets every 5 minutes
*/5 * * * * curl -X POST http://localhost:3000/api/cron/sync-markets -H "Authorization: Bearer $CRON_SECRET"

# Start new cohort every Sunday at 00:00 UTC
0 0 * * 0 curl -X POST http://localhost:3000/api/cron/start-cohort -H "Authorization: Bearer $CRON_SECRET"

# Run decisions every Sunday at 00:00 UTC
5 0 * * 0 curl -X POST http://localhost:3000/api/cron/run-decisions -H "Authorization: Bearer $CRON_SECRET"

# Check resolutions every hour
0 * * * * curl -X POST http://localhost:3000/api/cron/check-resolutions -H "Authorization: Bearer $CRON_SECRET"

# Take snapshots every 10 minutes (mark-to-market, including closed-but-unresolved markets)
*/10 * * * * curl -X POST http://localhost:3000/api/cron/take-snapshots -H "Authorization: Bearer $CRON_SECRET"

# Backup before new cohort (Saturday 23:00 UTC)
0 23 * * 6 curl -X POST http://localhost:3000/api/cron/backup -H "Authorization: Bearer $CRON_SECRET"

Configuration

Betting Constraints

Parameter	Value	Description
Initial Balance	$10,000	Starting capital per agent
Minimum Bet	$50	Smallest allowed bet
Maximum Bet	25% of cash	Largest allowed bet
Markets Shown	100	Top markets by volume

LLM Settings

Parameter	Value	Description
Temperature	0	Deterministic outputs
Max Tokens	2,000	Response length limit
Retry Count	1	Retries on malformed response

Deployment

DigitalOcean Droplet

Create a droplet (Ubuntu 22.04, 2GB RAM minimum)
SSH into the server
Install Node.js 18+
Clone the repository
Install dependencies
Set up environment variables
Build and start with PM2
Configure Nginx reverse proxy
Set up SSL with Let's Encrypt
Configure cron jobs

Full deployment guide: docs/DEPLOYMENT.md

Quick Deploy

# On server
git clone https://github.com/setrf/forecasterarena.git
cd forecasterarena
npm install
cp .env.example .env.local
# Edit .env.local
npm run build
pm2 start npm --name forecaster -- start

Documentation

Document	Description
METHODOLOGY_v1.md	Complete academic methodology
ARCHITECTURE.md	System architecture diagrams
DATABASE_SCHEMA.md	All tables and relationships
PROMPT_DESIGN.md	LLM prompt engineering
SCORING.md	Brier Score and P/L formulas
API_REFERENCE.md	Complete API documentation
DEPLOYMENT.md	Step-by-step deployment
DEPLOYMENT_CHECKLIST.md	Pre-deployment checklist
TROUBLESHOOTING.md	Troubleshooting guide
OPERATIONS.md	Operations runbook
DECISIONS.md	Design decision rationale

Contributing

Contributions are welcome! Please read our contributing guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Areas for Contribution

Additional chart visualizations
Test coverage
Documentation improvements
Bug fixes
New features

Disclaimer

This is a research and educational project.

All trading is simulated (paper trading)
No real money is ever at risk
This is not financial advice
Past performance does not predict future results
The benchmark evaluates LLM reasoning, not investment strategies

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Polymarket for market data
OpenRouter for unified LLM API access
The open-source community for the amazing tools

Built for the AI research community

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
app		app
components		components
docs		docs
lib		lib
presentation		presentation
public		public
scripts		scripts
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
debug.log		debug.log
middleware.ts		middleware.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

License

setrf/forecasterarena

Folders and files

Latest commit

History

Repository files navigation

Forecaster Arena

Table of Contents

Overview

Key Features

Why This Matters

The Problem with Traditional Benchmarks

Our Solution: Reality as Benchmark

The 7 Competing Models

How It Works

Weekly Cycle

Decision Format

Scoring System

1. Brier Score (Calibration)

2. Portfolio Returns (P/L)

Why Both Metrics?

Architecture

Tech Stack

Getting Started

Prerequisites

Installation

Environment Variables

Development

Production Build

Project Structure

API Reference

Public Endpoints

Admin Endpoints

Cron Endpoints

Cron Jobs

Configuration

Betting Constraints

LLM Settings

Deployment

DigitalOcean Droplet

Quick Deploy

Documentation

Contributing

Areas for Contribution

Disclaimer

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages