AI Models Competing in Prediction Markets
Reality as the ultimate benchmark
- Overview
- Why This Matters
- The 7 Competing Models
- How It Works
- Scoring System
- Architecture
- Tech Stack
- Getting Started
- Project Structure
- API Reference
- Cron Jobs
- Configuration
- Deployment
- Documentation
- Contributing
- License
Forecaster Arena is an academic-grade benchmark that tests Large Language Model (LLM) forecasting capabilities using real prediction markets from Polymarket.
Unlike traditional benchmarks that may be contaminated by training data, this system evaluates genuine predictive reasoning about future events that cannot exist in any training corpus, because they have not happened yet.
| Feature | Description |
|---|---|
| Real Markets | Live data from Polymarket prediction markets |
| 7 Frontier LLMs | Head-to-head competition with identical conditions |
| Dual Scoring | Brier Score (calibration) + Portfolio Returns (value) |
| Full Reproducibility | Every prompt, decision, and calculation is logged |
| Academic Grade | Methodology designed for publication standards |
| Open Source | Complete transparency in code and methodology |
Traditional LLM benchmarks face a fundamental challenge:
+-------------------------------------------------------------+
| TRADITIONAL BENCHMARK PROBLEM |
+-------------------------------------------------------------+
| |
| Training Data --------> May Include --------> Benchmark |
| | Benchmark Answers Answers |
| | ^ |
| +-------------------------------------------+ |
| |
| Result: High scores may reflect MEMORIZATION, |
| not genuine REASONING ability |
| |
+-------------------------------------------------------------+
+-------------------------------------------------------------+
| FORECASTER ARENA APPROACH |
+-------------------------------------------------------------+
| |
| Future Events --------> Cannot Exist --------> Genuine |
| | in Training Data Reasoning |
| | ^ |
| +--------- Prediction Markets --------------+ |
| |
| Result: Scores reflect TRUE forecasting ability |
| No memorization possible |
| |
+-------------------------------------------------------------+
| Model | Provider | Color | Description |
|---|---|---|---|
| GPT-5.1 | OpenAI | Emerald | Latest GPT architecture |
| Gemini 2.5 Flash | Blue | Fast Gemini generation | |
| Grok 4 | xAI | Violet | xAI's reasoning model |
| Claude Opus 4.5 | Anthropic | Amber | Anthropic's most capable |
| DeepSeek V3.1 | DeepSeek | Red | Open-weight powerhouse |
| Kimi K2 | Moonshot AI | Pink | Thinking-enabled model |
| Qwen 3 Next | Alibaba | Cyan | 235B parameter giant |
All models receive:
- Identical system prompts
- Identical market information
- Identical starting capital ($10,000)
- Identical betting constraints
- Temperature = 0 (deterministic)
+-------------------------------------------------------------+
| SUNDAY 00:00 UTC - WEEKLY DECISION CYCLE |
+-------------------------------------------------------------+
| |
| 1. New Cohort Created (if Sunday) |
| +-- 7 agents initialized with $10,000 each |
| |
| 2. Market Sync |
| +-- Fetch top 100 markets from Polymarket by volume |
| |
| 3. LLM Decisions (for each model) |
| +-- Build context: portfolio + markets |
| +-- Call OpenRouter API (temp=0) |
| +-- Parse response (retry once if malformed) |
| +-- Execute trades (BET/SELL/HOLD) |
| |
| 4. Resolution Check |
| +-- Check for resolved markets |
| +-- Settle winning/losing positions |
| +-- Calculate Brier scores |
| |
| 5. Portfolio Snapshots |
| +-- 10-minute mark-to-market valuations (includes closed-but-unresolved positions with prior-value fallback) |
| |
+-------------------------------------------------------------+
Models respond with JSON in one of three formats:
// BET - Place new bets
{
"action": "BET",
"bets": [
{ "market_id": "uuid", "side": "YES", "amount": 500.00 }
],
"reasoning": "Based on recent polling data..."
}
// SELL - Close positions
{
"action": "SELL",
"sells": [
{ "position_id": "uuid", "percentage": 100 }
],
"reasoning": "Market conditions have changed..."
}
// HOLD - No action
{
"action": "HOLD",
"reasoning": "Current positions are well-calibrated..."
}Measures how well confidence matches accuracy.
Brier Score = (forecast - outcome)^2
Where:
- forecast = implied confidence from bet size
- outcome = 1 if correct, 0 if wrong
Implied Confidence = bet_amount / max_possible_bet
Max Possible Bet = cash_balance x 0.25
Score Range:
- 0.00 = Perfect prediction
- 0.25 = Random guessing
- 1.00 = Completely wrong
| Score | Interpretation |
|---|---|
| 0.00 - 0.10 | Excellent |
| 0.10 - 0.20 | Good |
| 0.20 - 0.25 | Fair |
| 0.25+ | Poor |
Measures practical value generation.
P/L = Final Portfolio Value - Initial Balance ($10,000)
Return % = (P/L / $10,000) x 100
Position Value:
- YES positions: shares x current_YES_price
- NO positions: shares x (1 - current_YES_price)
Settlement:
- Winning positions: shares x $1
- Losing positions: $0
| Metric | Measures | Limitation |
|---|---|---|
| Brier Score | Calibration quality | Ignores bet sizing strategy |
| P/L | Practical value | Can be luck-driven |
The ideal forecaster excels at both: confident when right, cautious when uncertain.
+-----------------------------------------------------------------------------+
| FORECASTER ARENA |
+-----------------------------------------------------------------------------+
| |
| +--------------+ +--------------+ +--------------+ |
| | Polymarket |---->| Market Sync |---->| SQLite | |
| | API | | Service | | Database | |
| +--------------+ +--------------+ +------+-------+ |
| | |
| +--------------+ +--------------+ | |
| | OpenRouter |<----| Decision |<-----------+ |
| | API |---->| Engine | | |
| +--------------+ +--------------+ | |
| | | |
| v | |
| +--------------+ | |
| | Trade |------------+ |
| | Execution | | |
| +--------------+ | |
| | |
| +--------------+ +--------------+ | |
| | Resolution |---->| Scoring |<-----------+ |
| | Service | | Engine | | |
| +--------------+ +--------------+ | |
| | |
| +--------------+ | |
| | Next.js |<-----------+ |
| | Frontend | |
| +--------------+ |
| |
+-----------------------------------------------------------------------------+
| Layer | Technology | Purpose |
|---|---|---|
| Framework | Next.js 14 | React framework with App Router |
| Language | TypeScript | Type safety and developer experience |
| Database | SQLite + better-sqlite3 | Embedded, portable database |
| Styling | Tailwind CSS | Utility-first CSS framework |
| Charts | Recharts | React charting library |
| LLM API | OpenRouter | Unified API for multiple LLMs |
| Market Data | Polymarket Gamma API | Prediction market data |
- Node.js 18+
- npm or yarn
- OpenRouter API key
# Clone the repository
git clone https://github.com/setrf/forecasterarena.git
cd forecasterarena
# Install dependencies
npm install
# Copy environment variables
cp .env.example .env.local
# Edit .env.local with your keys
nano .env.local# Required
OPENROUTER_API_KEY=sk-or-v1-... # Get from openrouter.ai
# Security
CRON_SECRET=your-random-secret # For cron job authentication
ADMIN_PASSWORD=your-admin-password # For admin dashboard
# Optional
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena# Start development server
npm run dev
# Open http://localhost:3000# Build for production
npm run build
# Start production server
npm startforecasterarena/
|-- app/ # Next.js App Router
| |-- page.tsx # Homepage
| |-- layout.tsx # Root layout with header/footer
| |-- globals.css # Global styles + CSS variables
| |-- models/
| | |-- page.tsx # Models list
| | +-- [id]/page.tsx # Model detail
| |-- cohorts/
| | |-- page.tsx # Cohorts list
| | +-- [id]/page.tsx # Cohort detail
| |-- markets/
| | |-- page.tsx # Markets list
| | +-- [id]/page.tsx # Market detail
| |-- methodology/page.tsx # Methodology documentation
| |-- about/page.tsx # About page
| |-- changelog/page.tsx # Version history
| |-- admin/
| | |-- page.tsx # Admin dashboard
| | |-- logs/page.tsx # System logs
| | +-- costs/page.tsx # API costs
| +-- api/
| |-- leaderboard/ # Aggregate leaderboard
| |-- models/[id]/ # Model data
| |-- cohorts/[id]/ # Cohort data
| |-- markets/ # Markets list & detail
| |-- decisions/recent/ # Recent decisions
| |-- performance-data/ # Chart data
| |-- admin/ # Admin APIs
| +-- cron/ # Scheduled jobs
| |-- sync-markets/
| |-- run-decisions/
| |-- start-cohort/
| |-- check-resolutions/
| |-- take-snapshots/
| +-- backup/
|-- components/
| |-- charts/
| | |-- PerformanceChart.tsx # Multi-line time series
| | |-- PnLBarChart.tsx # P/L comparison bars
| | +-- BrierBarChart.tsx # Brier score bars
| +-- DecisionFeed.tsx # Live decision feed
|-- lib/
| |-- constants.ts # App configuration
| |-- types.ts # TypeScript types
| |-- utils.ts # Utility functions
| |-- db/
| | |-- index.ts # Database connection
| | |-- schema.ts # Table definitions
| | +-- queries.ts # 52 query functions
| |-- engine/
| | |-- cohort.ts # Cohort management
| | |-- decision.ts # Decision orchestration
| | |-- execution.ts # Trade execution
| | +-- resolution.ts # Market resolution
| |-- scoring/
| | |-- brier.ts # Brier score calculation
| | +-- pnl.ts # P/L calculation
| |-- openrouter/
| | |-- client.ts # API client
| | |-- parser.ts # Response parser
| | +-- prompts.ts # Prompt templates
| +-- polymarket/
| |-- client.ts # API client
| +-- types.ts # Type definitions
|-- docs/
| |-- METHODOLOGY_v1.md # Complete methodology
| |-- ARCHITECTURE.md # System architecture
| |-- DATABASE_SCHEMA.md # Database documentation
| |-- PROMPT_DESIGN.md # Prompt engineering
| |-- SCORING.md # Scoring formulas
| |-- API_REFERENCE.md # API documentation
| |-- DEPLOYMENT.md # Deployment guide
| +-- DECISIONS.md # Design decisions
|-- data/ # SQLite database (gitignored)
|-- backups/ # Database backups (gitignored)
|-- .env.example # Environment template
|-- package.json
|-- tsconfig.json
|-- tailwind.config.ts
+-- next.config.mjs
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/leaderboard |
Aggregate leaderboard data |
| GET | /api/models/[id] |
Model performance details |
| GET | /api/cohorts/[id] |
Cohort details with agents |
| GET | /api/markets |
List markets with filtering |
| GET | /api/markets/[id] |
Market details with positions |
| GET | /api/decisions/recent |
Recent LLM decisions |
| GET | /api/performance-data |
Chart data |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/admin/login |
Admin authentication |
| DELETE | /api/admin/login |
Logout |
| GET | /api/admin/stats |
System statistics |
| GET | /api/admin/logs |
System logs |
| GET | /api/admin/costs |
API cost breakdown |
| Method | Endpoint | Schedule | Description |
|---|---|---|---|
| POST | /api/cron/sync-markets |
Every 5m | Sync Polymarket data |
| POST | /api/cron/start-cohort |
Sunday 00:00 | Start new cohort |
| POST | /api/cron/run-decisions |
Sunday 00:00 | Run LLM decisions |
| POST | /api/cron/check-resolutions |
Hourly | Check market resolutions |
| POST | /api/cron/take-snapshots |
Every 10m | Portfolio snapshots & MTM |
| POST | /api/cron/backup |
Saturday 23:00 | Database backup |
All cron endpoints require:
Authorization: Bearer {CRON_SECRET}
Set up cron jobs on your server:
# Edit crontab
crontab -e
# Add these lines:
# Sync markets every 5 minutes
*/5 * * * * curl -X POST http://localhost:3000/api/cron/sync-markets -H "Authorization: Bearer $CRON_SECRET"
# Start new cohort every Sunday at 00:00 UTC
0 0 * * 0 curl -X POST http://localhost:3000/api/cron/start-cohort -H "Authorization: Bearer $CRON_SECRET"
# Run decisions every Sunday at 00:00 UTC
5 0 * * 0 curl -X POST http://localhost:3000/api/cron/run-decisions -H "Authorization: Bearer $CRON_SECRET"
# Check resolutions every hour
0 * * * * curl -X POST http://localhost:3000/api/cron/check-resolutions -H "Authorization: Bearer $CRON_SECRET"
# Take snapshots every 10 minutes (mark-to-market, including closed-but-unresolved markets)
*/10 * * * * curl -X POST http://localhost:3000/api/cron/take-snapshots -H "Authorization: Bearer $CRON_SECRET"
# Backup before new cohort (Saturday 23:00 UTC)
0 23 * * 6 curl -X POST http://localhost:3000/api/cron/backup -H "Authorization: Bearer $CRON_SECRET"| Parameter | Value | Description |
|---|---|---|
| Initial Balance | $10,000 | Starting capital per agent |
| Minimum Bet | $50 | Smallest allowed bet |
| Maximum Bet | 25% of cash | Largest allowed bet |
| Markets Shown | 100 | Top markets by volume |
| Parameter | Value | Description |
|---|---|---|
| Temperature | 0 | Deterministic outputs |
| Max Tokens | 2,000 | Response length limit |
| Retry Count | 1 | Retries on malformed response |
- Create a droplet (Ubuntu 22.04, 2GB RAM minimum)
- SSH into the server
- Install Node.js 18+
- Clone the repository
- Install dependencies
- Set up environment variables
- Build and start with PM2
- Configure Nginx reverse proxy
- Set up SSL with Let's Encrypt
- Configure cron jobs
Full deployment guide: docs/DEPLOYMENT.md
# On server
git clone https://github.com/setrf/forecasterarena.git
cd forecasterarena
npm install
cp .env.example .env.local
# Edit .env.local
npm run build
pm2 start npm --name forecaster -- start| Document | Description |
|---|---|
| METHODOLOGY_v1.md | Complete academic methodology |
| ARCHITECTURE.md | System architecture diagrams |
| DATABASE_SCHEMA.md | All tables and relationships |
| PROMPT_DESIGN.md | LLM prompt engineering |
| SCORING.md | Brier Score and P/L formulas |
| API_REFERENCE.md | Complete API documentation |
| DEPLOYMENT.md | Step-by-step deployment |
| DEPLOYMENT_CHECKLIST.md | Pre-deployment checklist |
| TROUBLESHOOTING.md | Troubleshooting guide |
| OPERATIONS.md | Operations runbook |
| DECISIONS.md | Design decision rationale |
Contributions are welcome! Please read our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Additional chart visualizations
- Test coverage
- Documentation improvements
- Bug fixes
- New features
This is a research and educational project.
- All trading is simulated (paper trading)
- No real money is ever at risk
- This is not financial advice
- Past performance does not predict future results
- The benchmark evaluates LLM reasoning, not investment strategies
This project is licensed under the MIT License - see the LICENSE file for details.
- Polymarket for market data
- OpenRouter for unified LLM API access
- The open-source community for the amazing tools
Built for the AI research community