Muon optimizer made easy!
This is a single-file, plug-and-play implementation of the Muon optimizer designed for simplicity and stability. It combines Muon (for 2D hidden layers) and AdamW (for embeddings/biases) into a single, unified optimizer class.
What is Muon? It uses orthogonal gradient updates via Newton-Schulz iterations to train Deep Models more efficiently than standard AdamW.
- 🚀 Drop-in Replacement: Just copy
easy_muon.pyinto your project. - 🛡️ Automatic Fallback: Applies Muon to 2D matrices and AdamW to everything else automatically.
- ⚡ Optimized: Uses many fused operations in
torch. - 🤝 Distributed Ready: Works out-of-the-box with DDP and FSDP.
- 🌚 Moonlight Scaling: Implements the recommended scaling from Moonlight, allowing you to reuse standard AdamW hyperparameters.
Just grab the file and let the helper function do the work:
from easy_muon import Muon, build_muon_param_groups
# 1. Automatically split params into Muon (matrices) and AdamW (biases/embeds)
param_groups = build_muon_param_groups(model)
# 2. Initialize optimizer
optimizer = Muon(
param_groups,
lr=3e-4,
weight_decay=0.1,
momentum=0.95
)
# 3. Train as usual!
optimizer.zero_grad()
loss.backward()
optimizer.step()You can fully customize which parameters go where:
optimizer = Muon([
# Group 1: The heavy lifters (Muon)
{
'params': model.layers.parameters(),
'use_muon': True,
'lr': 3e-4,
'ns_steps': 5
},
# Group 2: The sensitive parts (AdamW)
{
'params': model.embeddings.parameters(),
'use_muon': False,
'lr': 1e-4,
'adamw_betas': (0.9, 0.999)
}
])This implementation supports two scaling strategies for the Muon update:
"moonlight"(Default & Recommended):- Balances updates to match AdamW's RMS.
- Benefit: You don't need to tune learning rates from scratch; standard AdamW LRs usually work immediately.
"original":- The original scaling from the Muon paper/repo.
- Note: Requires significantly higher learning rates (e.g., ~0.02) to be effective.
- Muon (Keller Jordan): The original implementation.
- Moonlight: Source of the improved scaling logic.