Research and systems substrate for artifact-first training and evaluation of LLM-powered code agents on GPU programming tasks.
The repository centers on one idea:
GPU-agent work gets much easier to reason about when traces, benchmarks, build artifacts, profile summaries, candidate transitions, and training-readiness decisions are all first-class objects instead of side effects hidden in ad hoc scripts.
This repo is the resulting cockpit:
- a local execution and evaluation surface for
run,bench,eval,build,inspect,compare,replay, andbundle - a task and benchmark substrate spanning internal Triton/CUDA-style workloads and curated public benchmark imports
- a transition-aware collection layer for trajectories, patch-bearing repair traces, reformulation episodes, and SFT packaging
Important
This repository is not a one-command dense-training stack. Its primary job is to make GPU-agent tasks, artifacts, datasets, and transition-rich evaluation state explicit enough that later SFT and narrow RL runs can start from a clean, reproducible substrate.
Table of contents
| Category | Description |
|---|---|
| Core package | gpu_cockpit/ contains contracts, execution engines, backend adapters, CLI entrypoints, and collection logic. |
| Workload substrate | workloads/ contains internal task specs, baselines, public benchmark imports, reference implementations, fixtures, and evaluation hooks. |
| Golden verification surface | tests/ contains regression tests plus checked-in golden bundles, datasets, retrieval fixtures, and multistep episode fixtures. |
| Knowledge and retrieval | knowledge/ contains operator docs, profiler playbooks, transformation cards, benchmark notes, and hardware notes. |
| Training configs | configs/training/ contains smoke SFT and rollout configs for the first bounded tool-use target. |
| It is | It is not |
|---|---|
| A GPU-agent cockpit and data factory | A generic LLM chat app |
| A benchmark and eval normalization layer | A leaderboard-only benchmark wrapper |
| A transition-aware trace and SFT substrate | A finished RL training stack |
| Training-facing configs and smoke scripts | A hardware-specific transfer or bootstrap playbook |
| Area | Status | Notes |
|---|---|---|
| Contracts and schemas | 🟢 | Versioned contracts for runs, traces, profiles, replay packs, trajectories, SFT, rollout configs, patches, and candidate lineage |
| NVIDIA workflow | 🟢 | nsys, normalized profiling surfaces, sanitizer normalization, tri-view build artifacts, bottleneck summaries |
| AMD mirror | 🟡 | Narrow mirrored trace/profile path and golden fixtures; intentionally not broad parity yet |
| Internal task verbs | 🟢 | diagnose, debug, reformulate, and optimize are all represented |
| Public benchmark imports | 🟢 | Curated KernelBench and ComputeEval adapters |
| Retrieval and knowledge | 🟢 | Docs, run examples, and patch-bearing episodes are queryable through a local index |
| Training packaging | 🟢 | Transition-rich trajectory export, SFT packaging, rollout smoke evaluation, and checked-in smoke configs |
| Full training execution | 🔵 | Intentionally separated from the default local workflow and gated behind smoke validation |
Note
The strongest path forward is not broadening every surface equally. The current priority is high-quality repair, diagnose, and reformulate traces that survive inspection, replay, and packaging cleanly.
| Surface | Command family | Output |
|---|---|---|
| Environment and hardware inspection | gpc doctor |
Toolchain availability, hardware fingerprint, vendor details |
| Registry and benchmark inventory | gpc task ..., gpc adapter ... |
Task listings, adapter summaries, curated case metadata |
| Run and build capture | gpc run, gpc build |
Run bundle, command summary, tri-view artifacts, system traces, profiles |
| Task evaluation | gpc eval |
Correctness, determinism, anti-hack, perf, and gate summary artifacts |
| Bundle analysis | gpc inspect, gpc compare, gpc replay, gpc bundle |
Quality projections, lineage, proof bundles, replay validation |
| Agent environment | gpc env action-space, gpc env scripted |
Compact observations and scripted multistep episodes |
| Offline data export | gpc trajectory ..., gpc sft ... |
Trajectory datasets and packaged SFT corpora |
| Knowledge and retrieval | gpc knowledge ... |
Mixed docs-plus-examples lookup |
| Training scaffolding | gpc train ..., gpc rollout ... |
Config validation, smoke reports, held-out scripted baselines |
- Artifact-first: every serious step emits inspectable bundle state rather than ad hoc console noise
- Transition-aware: episodes capture candidate lineage, patch hashes, patch kinds, and repair/reformulate transitions
- Governed packaging: run-level readiness and episode-level readiness are separated on purpose
- Task-rich: internal Triton/CUDA-style tasks coexist with curated public benchmark adapters
- Training-targeted: the current data and config surface is oriented toward bounded tool-use on a strong sub-40B model, not open-ended frontier RL
gpu_rl/
├── README.md # top-level project overview, usage guide, and architecture map
├── pyproject.toml # package metadata and the `gpc` console entrypoint
├── .gitignore # runtime, dataset, and planning-artifact exclusions
├── configs/
│ └── training/
│ ├── first_target_splits_v1.json # frozen train/dev split definition for the first training target
│ ├── rollout_debug_repair_heldout_v1.json
│ ├── rollout_debug_repair_v1.json # scripted rollout configs for local and held-out evaluation
│ └── sft_qwen32b_debug_repair_lora.json
│ # first checked-in SFT smoke config for the initial training target
├── docs/
├── gpu_cockpit/
│ ├── cli/
│ │ └── main.py # public `gpc` CLI surface
│ ├── contracts/
│ │ ├── compare.py
│ │ ├── environment.py
│ │ ├── evidence.py
│ │ ├── patch.py
│ │ ├── replay.py
│ │ ├── summary.py
│ │ ├── training.py
│ │ └── trajectory.py # schema-first contract layer for bundles, episodes, and training configs
│ ├── engine/
│ │ ├── benchmark.py
│ │ ├── environment.py
│ │ ├── evaluator.py
│ │ ├── evidence.py
│ │ ├── inspector.py
│ │ ├── knowledge.py
│ │ ├── patching.py
│ │ ├── replay.py
│ │ ├── rollout.py
│ │ ├── runner.py
│ │ ├── sft.py
│ │ ├── training.py
│ │ └── trajectory.py # execution, eval, inspection, retrieval, and data-packaging engines
│ ├── backends/
│ │ ├── amd/ # ROCm trace/profile normalization and AMD mirrored-path logic
│ │ └── nvidia/ # Nsight, sanitizer, disassembly, and profile normalization
│ ├── executors/ # local host and docker execution backends
│ ├── workloads/ # adapter registration and workload-facing package helpers
│ └── artifacts/
│ └── schemas/
│ # exported JSON schemas for the contract layer
├── knowledge/
│ ├── README.md
│ ├── benchmark_notes/
│ ├── hardware_notes/
│ ├── operator_families/
│ ├── profiler_playbooks/
│ └── transformation_cards/ # human-written retrieval corpus for operators, bottlenecks, and transforms
├── scripts/
│ ├── build_heldout_baseline_report.py
│ ├── build_first_target_training_assets.py
│ ├── export_schemas.py
│ ├── generate_transition_goldens.py
│ ├── smoke_rollout_eval.py
│ └── smoke_sft_train.py # reproducible builders and smoke paths for schemas, datasets, and training assets
├── tests/
│ ├── golden_datasets/
│ ├── golden_episodes/
│ ├── golden_retrieval/
│ ├── golden_runs/
│ ├── test_environment.py
│ ├── test_evaluator.py
│ ├── test_inspector.py
│ ├── test_knowledge.py
│ ├── test_sft.py
│ ├── test_training.py
│ └── test_trajectory.py # regression suite plus checked-in golden bundles and training-facing fixtures
└── workloads/
├── baselines/
├── benchmarks/
├── fixtures/
├── public_benchmarks/
├── reference/
├── tasks/
└── tests/ # task specs, baselines, curated imports, reference kernels, and hook scripts
These documents freeze the semantics and boundaries that matter for the first training wave:
| Document | Purpose |
|---|---|
docs/PROJECT_SCOPE.md |
Defines the finished local environment/data scope, deferred training execution work, and explicit non-goals |
docs/GLOSSARY.md |
Freezes the shared vocabulary across runs, episodes, governance, replay, and training-facing data |
docs/FIRST_WAVE_TASKS.md |
Inventory of the first-wave training task families and why each is in scope |
docs/BENCHMARK_POLICY.md |
Policy for how public benchmark traces participate in packaging, reporting, and training |
docs/AMD_SCOPE.md |
Explicit narrow-scope AMD mirrored-path boundary for the current program |
docs/OBSERVABILITY_SURFACE.md |
Frozen local scope for build, trace, profile, sanitizer, and bottleneck artifacts |
docs/REPLAY_COMPARE.md |
Replay, compare, and proof-bundle semantics for transition-aware review and packaging |
docs/DATA_GOVERNANCE.md |
Run-level readiness versus episode-level governance and training-example semantics |
docs/POLICY_INTERFACE.md |
First-wave action surface, observation model, and rollout semantics |
docs/RETRIEVAL_GUIDE.md |
Retrieval corpus structure and recommended query patterns |
docs/REMOTE_SANDBOX_ABSTRACTION.md |
Neutral remote-session contract, sync policy, and artifact transfer boundary |
For schema work, bundle inspection, knowledge indexing, and non-GPU smoke paths:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .| Goal | Requirements |
|---|---|
| Basic package and CLI use | Python 3.12+, pip, editable install |
| Triton/CUDA task execution | torch, triton, working CUDA runtime |
| Rich NVIDIA observability | nsys, ncu, compute-sanitizer, optional nvcc |
| Narrow AMD parity checks | ROCm runtime plus rocprof / rocprofv3, rocminfo, rocm-smi, hipcc |
| End-to-end smoke path | The above plus enough GPU support for the reference Triton tasks |
Tip
The repository is intentionally usable in partial environments. Missing profiler or vendor tools degrade the corresponding run surfaces instead of collapsing the whole CLI.
| Surface | NVIDIA | AMD | CPU-only |
|---|---|---|---|
| Contracts / schemas / inspection | 🟢 | 🟢 | 🟢 |
| Knowledge index and retrieval | 🟢 | 🟢 | 🟢 |
| Trajectory export / SFT packaging | 🟢 | 🟢 | 🟢 |
| Triton internal tasks | 🟢 | 🟡 | 🔴 |
| Public CUDA benchmark adapters | 🟢 | 🔴 | 🔴 |
nsys / ncu / sanitizer-backed runs |
🟢 | 🔴 | 🔴 |
| ROCm mirrored profile path | 🔴 | 🟡 | 🔴 |
| Local smoke training scaffolding | 🟢 | 🟢 | 🟢 |
gpc doctorgpc task list
gpc adapter list
gpc adapter summary kernelbench
gpc adapter summary computeevalgpc eval \
--task task/reduction_debug/eval/v1 \
--determinism-runs 2 \
-- python3 workloads/reference/triton_row_sum_patchable_candidate.py --benchmark-repeats 5gpc inspect runs/<run_id> --section quality
gpc compare runs/<run_a> runs/<run_b>
gpc replay runs/<run_id>
gpc bundle runs/<run_id> --fullpython3 scripts/build_first_target_training_assets.py
python3 scripts/build_heldout_baseline_report.py
python3 scripts/smoke_sft_train.py
python3 scripts/smoke_rollout_eval.py| Family | Subcommands / flags worth knowing |
|---|---|
doctor |
Local hardware and toolchain discovery |
task, adapter |
Task registry, adapter inventory, benchmark case summaries |
build, run, eval, bench |
Build/disassembly, execution, evaluation, benchmarking |
inspect, compare, replay, bundle, runs |
Bundle analysis, lineage inspection, proof export, index queries |
trajectory, env, sft |
Episode generation, bounded environment helpers, SFT packaging |
knowledge |
Build local index, free-text query, retrieve similar tasks |
train, rollout |
Config validation, smoke SFT reports, scripted rollout suites |
gpc build \
--task task/attention_score/eval/v1 \
--triton-build-spec workloads/reference/triton_attention_score_kernel.py:get_build_spec \
--source-file workloads/reference/triton_attention_score_kernel.pyUseful when the immediate goal is inspecting:
- generated PTX
- SASS or disassembly output
- Triton IR stages
- source-to-PTX/SASS mapping summaries
gpc run \
--task task/attention_score/eval/v1 \
--trace-system \
--profile-kernel \
--profile-pack quick \
--sanitize \
--sanitize-tool memcheck \
--emit-disassembly \
--triton-build-spec workloads/reference/triton_attention_score_kernel.py:get_build_spec \
-- python3 workloads/reference/triton_attention_score_optimize_candidate.py --benchmark-repeats 5gpc env scripted \
--task task/reduction_debug/eval/v1 \
--out /tmp/reduction_debug_episode.json \
--step-budget 12 \
--workflow debug \
--with-build \
--triton-build-spec workloads/reference/triton_row_sum_kernel.py:get_build_spec \
-- python3 workloads/reference/triton_row_sum_patchable_candidate.py --benchmark-repeats 5gpc sft package \
datasets/first_target_transition_train_v1 \
--out-dir datasets/first_target_sft_train_v1 \
--split train \
--patch-bearing-only \
--governance usable_positive_sft \
--governance usable_negative_debug \
--governance usable_negative_transition \
--transition-kind repaired \
--transition-kind reformulated \
--transition-kind patch_applied \
--verb debug \
--verb reformulategpc knowledge build-index
gpc knowledge query \
--query "mask bug failed repair register pressure" \
--verb debug \
--prefer-mixed \
--limit 8Selected CLI idioms
gpc inspect runs/<run_id> --section build
gpc inspect runs/<run_id> --section profile
gpc inspect runs/<run_id> --section replay
gpc inspect runs/<run_id> --section qualitygpc bench --adapter kernelbench --case case/kernelbench/level1/40_layernorm/v0_1
gpc bench --adapter computeeval --case case/computeeval/2025_1/cuda_16/v1gpc train validate-config configs/training/sft_qwen32b_debug_repair_lora.json
gpc rollout scripted configs/training/rollout_debug_repair_heldout_v1.json --out-dir /tmp/heldout_rollout| Task family | Verb | Backend | What it exercises |
|---|---|---|---|
reduction_sum |
optimize |
Triton | Row-wise reduction kernels and correctness/perf contracts |
reduction_debug |
debug |
Triton | Broken masking/repair-oriented traces with patch-bearing fixes |
routing_argmax |
optimize |
Triton | Routing/indexing kernels |
topk_router |
optimize |
Triton | Routing top-k behavior and benchmarkable operator logic |
attention_score |
optimize |
Triton | Tiled causal attention-score kernels |
attention_reformulate |
reformulate |
Triton | Weak-vs-optimized strategy transitions |
kv_cache_gather |
optimize |
Triton | KV-cache gather behavior and attention-adjacent memory access patterns |
profile_diagnose |
diagnose |
CUDA | Bottleneck interpretation and profiler-conditioned analysis |
smoke / smoke_eval |
diagnose |
CUDA / Triton | Minimal substrate checks |
amd_smoke |
diagnose |
HIP | Narrow AMD mirrored-path validation |
| Adapter | Curated cases | Source | Notes |
|---|---|---|---|
kernelbench |
11 |
KernelBench | Activation, normalization, reduction, indexing, matmul, attention-adjacent coverage |
computeeval |
8 |
ComputeEval | CUDA kernel launch, streams, reductions, CUB, Thrust, and metadata-heavy variants |
- Internal tasks carry richer semantics for repair, reformulation, hidden failures, patch transitions, and training governance.
- Public adapters provide external provenance, broader operator coverage, and a reality check against overfitting to bespoke tasks.
- The packaging defaults deliberately prefer transition-rich internal traces over thin public benchmark wrappers unless explicitly configured otherwise.
Every serious action in the cockpit writes a run directory that can be inspected, replayed, compared, or packaged.
runs/<run_id>/
├── manifest.json
├── events.jsonl
├── summary.json
├── summary.md
├── prompt/
│ └── task_spec.json
├── meta/
│ ├── doctor_report.json
│ ├── hardware_fingerprint.json
│ └── task_spec_full.json
├── command/
│ ├── stdout.txt
│ ├── stderr.txt
│ └── summary.json
├── correctness/
│ ├── correctness.json
│ ├── determinism.json
│ └── *_summary.json
├── eval/
│ ├── anti_hack_report.json
│ ├── eval_envelope.json
│ └── gate_summary.json
├── perf/
│ ├── raw_timings.json
│ └── benchmark.json
├── build/
│ ├── build_record.json
│ ├── source_map_summary.json
│ ├── tri_view.json
│ └── source_ptx_sass_map.json
├── patches/
│ ├── request.json
│ ├── unified_diff.patch
│ └── applied_patch.json
├── candidate/
│ ├── state.json
│ └── transition.json
└── replay/
├── command.json
├── environment.json
└── replay_pack.json
| Family | Why it matters |
|---|---|
summary.json / summary.md |
Fast human and programmatic overview of run state |
eval/eval_envelope.json |
Core pass/fail and reward-bearing evaluation gates |
perf/benchmark.json |
Perf gate inputs and baseline comparisons |
build/* |
Triton IR, PTX, SASS, source map summaries, tri-view artifacts |
patches/* and candidate/* |
Candidate lineage and transition-aware training traces |
replay/* |
Rehydration metadata for reproducibility and proof bundles |
The local knowledge base is intentionally small and structured rather than sprawling.
| Subdirectory | Content |
|---|---|
knowledge/operator_families/ |
Operator-specific notes such as reduction, attention score, KV-cache gather, and profile diagnosis |
knowledge/profiler_playbooks/ |
Bottleneck-oriented interpretive guides such as memory-bound or occupancy-limited cases |
knowledge/transformation_cards/ |
Tactical strategy cards such as tiling, vectorization, staging, masking, and layout changes |
knowledge/benchmark_notes/ |
Curated notes about imported public benchmarks |
knowledge/hardware_notes/ |
Vendor and platform-specific constraints such as AMD parity scope |
Retrieval is designed to return a mix of:
- docs
- prior run examples
- patch-bearing repair traces
- reformulation examples
- similar tasks
That bias is deliberate: useful training and debugging context usually mixes prose with concrete examples.
A large fraction of GPU-agent signal is not “one prompt, one answer.” The important learning unit is often:
- inspect a candidate
- identify a failure mode or weak strategy
- patch or transform the candidate
- rebuild, benchmark, compare, and re-evaluate
- decide whether the resulting trace is usable for training
The repository therefore treats the following as first-class:
patch_candidatetransitions- candidate lineage and parent-child state
- episode-level governance separate from run-level readiness
- usable negative traces for failed repair and failed reformulation
| Fixture class | Location |
|---|---|
| Transition datasets | tests/golden_datasets/transition_collection_v1 |
| Negative transition datasets | tests/golden_datasets/transition_negative_collection_v1 |
| Packaged SFT examples | tests/golden_datasets/transition_sft_v1 |
| Negative packaged SFT examples | tests/golden_datasets/transition_negative_sft_v1 |
| Episode fixtures | tests/golden_episodes/ |
| Run-bundle fixtures | tests/golden_runs/ |
pip install -e .
python3 scripts/export_schemas.py
python3 -m unittest discover -s tests -v# Rebuild checked-in transition fixtures
python3 scripts/generate_transition_goldens.py
# Rebuild train/dev assets for the first training target
python3 scripts/build_first_target_training_assets.py
# Validate training config and rollout config
gpc train validate-config configs/training/sft_qwen32b_debug_repair_lora.json
gpc rollout scripted configs/training/rollout_debug_repair_heldout_v1.json --out-dir /tmp/heldout_rollout- Start with
gpu_cockpit/cli/main.pyto see the public command surface. - Read
gpu_cockpit/engine/runner.py,gpu_cockpit/engine/evaluator.py, andgpu_cockpit/engine/inspector.pyfor the core run lifecycle. - Read
gpu_cockpit/engine/environment.py,gpu_cockpit/engine/trajectory.py, andgpu_cockpit/engine/sft.pyfor the training-facing data model. - Read the training configs in
configs/training/before changing training assumptions.
- source code
- tests
- schemas
- benchmark metadata
- golden runs / episodes / datasets used for regression coverage
- training configs
runs/datasets/docs_tmp/artifacts/- generated
knowledge/index/
This split is deliberate:
- checked-in goldens define the stable verification surface
- generated runtime outputs remain local and disposable
If you are here for the shortest orientation path:
- inspect the checked-in training configs
- run
gpc doctor - inspect
gpc --help - execute one
gpc evaltask - inspect a run bundle
- build the training assets
If you are here to extend the project:
- prefer richer internal repair and reformulate traces over shallow benchmark breadth
- preserve candidate lineage and governance semantics
- treat patch-bearing episodes as the highest-signal assets in the current system
If you are here to start training:
- finish validation on the checked-in smoke path
- use the checked-in training configs and smoke scripts
- validate on the intended training environment before launching larger jobs
- treat the smoke sequence as the gate to more expensive runs