Above: a 1-hour test run on dual RTX 4090s — 17 experiments, 3 kept improvements, 1.48% val_bpb improvement. Both GPUs running experiments in parallel, batch_size=4, depth=8, 50.3M params. A longer overnight run (8-24h) across more GPUs would find significantly more improvements, same as karpathy's 2-day run that found 20+ improvements across 276 experiments.
Same idea as autoresearch — agent modifies train.py, trains for 5 minutes, keeps or discards, repeats — but with structured experiment state, multi-GPU parallelism, adaptive search, and crash recovery via iii (Worker/Function/Trigger). The agent is still external. Claude, Codex, whatever you want. This repo is the infrastructure that replaces the bash loop, git-as-state, and flat TSV with queryable experiment tracking across N GPUs. More GPUs = more parallel experiments = faster research.
The repo has three files that matter, same as autoresearch:
prepare.py— data prep, tokenizer, eval. Not modified.train.py— model, optimizer, training loop. The agent edits this.program.md— agent instructions. The human edits this.
Two workers talk to iii to provide the infrastructure:
- Orchestrator (Python) — 22 functions for experiment tracking, search strategy, GPU pool, reporting.
- GPU Worker (Rust) — one per GPU, executes
uv run train.py, parses metrics, handles timeouts.
The agent calls the same uv run train.py but wraps it with REST API calls:
POST /api/experiment/setup — init run tag
POST /api/experiment/register — record hypothesis before training
POST /api/experiment/complete — record metrics, auto keep/discard
POST /api/search/suggest — get guidance on what to try next
POST /api/report/summary — full stats for a run tag
Training runs for a fixed 5-minute time budget (wall clock). The metric is val_bpb (validation bits per byte) — lower is better, vocab-size-independent so architectural changes are fairly compared.
Requirements: NVIDIA GPU(s), Python 3.10+, uv, Rust 1.82+.
# 1. Install iii
curl -fsSL https://install.iii.dev/iii/main/install.sh | sh
# 2. Clone and install
git clone https://github.com/iii-hq/n-autoresearch.git
cd n-autoresearch
uv sync
# 3. Download data and train tokenizer (one-time)
uv run prepare.py
# 4. Start iii
iii --config iii-config.yaml
# 5. Start orchestrator (new terminal)
uv run python workers/orchestrator/orchestrator.py
# 6. Start GPU worker (new terminal, one per GPU)
cd workers/gpu
GPU_INDEX=0 REPO_DIR=/path/to/n-autoresearch cargo run --release
# For multiple GPUs:
GPU_INDEX=1 REPO_DIR=/path/to/n-autoresearch cargo run --release
# 7. Point your agent at program.md and goPoint Claude Code, Codex, or any agent at this repo and prompt:
Read program.md and kick off a new experiment.
The agent loop:
1. POST /api/experiment/setup — init run tag
2. POST /api/search/suggest — get search guidance
3. edit train.py — the experiment
4. git commit
5. POST /api/experiment/register — record hypothesis
6. uv run train.py > run.log 2>&1 — train (5 min)
7. POST /api/experiment/complete — record results
{ improved: true, action: "keep_commit" }
{ improved: false, action: "git_reset" }
8. repeat from 2
If training crashes, POST /api/experiment/crash tracks consecutive failures and aborts after 3.
N GPU workers = N parallel experiments. Each agent acquires a GPU, trains, records, releases. Search strategy adapts globally across all GPUs.
curl localhost:3111/api/pool/list
# { "total": 2, "idle": 0, "training": 2 }GPU workers on different machines can point to the same orchestrator:
# CPU machine (orchestrator)
iii --config iii-config.yaml
uv run python workers/orchestrator/orchestrator.py
# Each GPU machine
III_WS_URL=ws://<orchestrator-ip>:49134 GPU_INDEX=0 cargo run --releasePorts 49134 (WebSocket) and 3111 (REST) must be reachable from GPU machines.
Strategy auto-adapts after each experiment based on recent history:
explore (default) broad random changes, try underexplored categories
exploit refine around best config, small incremental tweaks
combine merge two near-miss experiments that improved different aspects
ablation systematically remove components to find what matters
Transitions:
crash rate > 50% -> exploit (conservative)
plateau + near-misses available -> combine
plateau + no near-misses -> ablation
keep rate > 30% -> exploit
default -> explore
prepare.py data prep + eval (do not modify)
train.py model + optimizer + loop (agent modifies)
program.md agent instructions
iii-config.yaml iii runtime config
workers/
orchestrator/
orchestrator.py Python worker — 22 functions, 22 triggers
gpu/ Rust worker (one per GPU)
src/
main.rs init, GPU detection, pool registration
config.rs env config
state.rs StateKV wrapper
functions/train.rs gpu::train + gpu::health
triggers/mod.rs HTTP + cron triggers
experiment::setup init tag + branch + strategy
experiment::register record hypothesis before training
experiment::complete record metrics, auto keep/discard, detect near-misses
experiment::crash track consecutive crashes, abort after 3
experiment::history query by tag/status/limit
experiment::best current best for a tag
experiment::near_misses experiments within 0.002 BPB of best
search::strategy get current mode (explore/exploit/combine/ablation)
search::set_strategy manual override
search::adapt auto-adapt from experiment history
search::suggest_direction category stats, underexplored areas, concrete suggestions
pool::register_gpu GPU worker self-registers on startup
pool::heartbeat 30s heartbeat, offline after 60s stale
pool::list all GPUs with status
pool::acquire atomic claim of idle GPU
pool::release return GPU to pool
pool::deregister remove on shutdown
report::summary full stats, BPB progression, category breakdown
report::tsv export in original autoresearch TSV format
report::diff compare two experiments
report::tags list all run tags
- Single file to modify. The agent only touches
train.py. Diffs are reviewable, scope is manageable. - Fixed time budget. Training always runs for exactly 5 minutes. Experiments are directly comparable regardless of what the agent changes. Autoresearch finds the most optimal model for your platform in that time budget.
- Structured state. Experiments, lineage, GPU pool, search strategy all live in iii KV. Queryable, exportable, survives crashes.
- Multi-GPU native. N GPUs = N parallel experiments. Atomic GPU acquisition prevents conflicts. Strategy adapts globally.
- TSV compatibility.
report::tsvexports in the original autoresearch format for backwards compatibility.
Tested on dual RTX 4090 (24GB each). For smaller GPUs:
- Lower
DEPTHfrom 8 to 4 - Set
DEVICE_BATCH_SIZEto 4 (RTX 4090) or 2 (smaller) - Disable
torch.compileif compilation OOMs - Use
WINDOW_PATTERN = "L"instead of"SSSL"if attention is slow
For H100 (80GB), the defaults work as-is.
Apache-2.0
