Forge Agent
Swarm Agents That Turn Slow PyTorch Into Fast CUDA/Triton Kernels
Abstract
Forge transforms PyTorch models into production-grade CUDA/Triton kernels through automated multi-agent optimization. Using 32 parallel AI agents with inference-time scaling, it achieves up to 14× faster inference than torch.compile(mode='max-autotune-no-cudagraphs') while maintaining 100% numerical correctness.
1What is Forge?
Traditional PyTorch models suffer from suboptimal GPU utilization due to generic kernel implementations. Manual optimization requires deep CUDA expertise and weeks of engineering time per layer. Forge eliminates this bottleneck through automated kernel generation powered by competitive multi-agent search.
The system deploys 32 parallel Coder+Judge agent pairs that compete to discover optimal kernel implementations. Each agent explores different optimization strategies: tensor core utilization, memory coalescing, register blocking, shared memory tiling, and kernel fusion. Agents are powered by a fine-tuned NVIDIA Nemotron 3 Nano 30B model generating 250k tokens/second, enabling comprehensive search in minutes rather than hours.
Forge targets the full optimization stack—from layer-wise kernels to model-wide fusion patterns. The system works with any PyTorch model.
4Pricing
Agent Credits
Credit Refund if We Don't Beat torch.compile(mode='max-autotune-no-cudagraphs')
- ✓Datacenter GPUs Access (B200, H100, H200)
- ✓Inference Time Scaling with NVIDIA Nemotron 3 Nano 30B (250k tokens/sec)
- ✓32 Parallel Swarm Agents with Coder+Judge Pattern
- ✓Advanced Kernel Database Retrieval
- ✓Any PyTorch Model → All Layers Optimized