🌙 EasyMuon

Muon optimizer made easy!

This is a single-file, plug-and-play implementation of the Muon optimizer designed for simplicity and stability. It combines Muon (for 2D hidden layers) and AdamW (for embeddings/biases) into a single, unified optimizer class.

What is Muon? It uses orthogonal gradient updates via Newton-Schulz iterations to train Deep Models more efficiently than standard AdamW.

✨ Features

🚀 Drop-in Replacement: Just copy easy_muon.py into your project.
🛡️ Automatic Fallback: Applies Muon to 2D matrices and AdamW to everything else automatically.
⚡ Optimized: Uses many fused operations in torch.
🤝 Distributed Ready: Works out-of-the-box with DDP and FSDP.
🌚 Moonlight Scaling: Implements the recommended scaling from Moonlight, allowing you to reuse standard AdamW hyperparameters.

📦 Usage

1. Basic Setup (Recommended)

Just grab the file and let the helper function do the work:

from easy_muon import Muon, build_muon_param_groups

# 1. Automatically split params into Muon (matrices) and AdamW (biases/embeds)
param_groups = build_muon_param_groups(model)

# 2. Initialize optimizer
optimizer = Muon(
    param_groups,
    lr=3e-4,
    weight_decay=0.1,
    momentum=0.95
)

# 3. Train as usual!
optimizer.zero_grad()
loss.backward()
optimizer.step()

2. Advanced / Manual Configuration

You can fully customize which parameters go where:

optimizer = Muon([
    # Group 1: The heavy lifters (Muon)
    {
        'params': model.layers.parameters(), 
        'use_muon': True, 
        'lr': 3e-4, 
        'ns_steps': 5
    },
    # Group 2: The sensitive parts (AdamW)
    {
        'params': model.embeddings.parameters(), 
        'use_muon': False, 
        'lr': 1e-4, 
        'adamw_betas': (0.9, 0.999)
    }
])

⚙️ Scaling Modes

This implementation supports two scaling strategies for the Muon update:

"moonlight" (Default & Recommended):
- Balances updates to match AdamW's RMS.
- Benefit: You don't need to tune learning rates from scratch; standard AdamW LRs usually work immediately.
"original":
- The original scaling from the Muon paper/repo.
- Note: Requires significantly higher learning rates (e.g., ~0.02) to be effective.

📜 References

Muon (Keller Jordan): The original implementation.
Moonlight: Source of the improved scaling logic.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
easy_muon.py		easy_muon.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌙 EasyMuon

✨ Features

📦 Usage

1. Basic Setup (Recommended)

2. Advanced / Manual Configuration

⚙️ Scaling Modes

📜 References

About

Uh oh!

Releases

Packages

Languages

License

Adversarr/EasyMuon

Folders and files

Latest commit

History

Repository files navigation

🌙 EasyMuon

✨ Features

📦 Usage

1. Basic Setup (Recommended)

2. Advanced / Manual Configuration

⚙️ Scaling Modes

📜 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages