Reproducing & Validating Distributed Muon (MoonshotAI) — Performance & Communication Results

bird-of-paradise · November 29, 2025, 10:47pm

Hi everyone!

I just published a full technical report where I reproduced MoonshotAI’s Distributed Muon optimizer, validated their communication efficiency claims, and profiled DP/TP configurations on a 4-GPU cluster.

The post includes:
• Full Muon DP=2/TP=2 and Adam profiling
• Perfetto traces (communication patterns)
• Memory analysis
• Two bug fixes to the open-source PoC
• Async-op experiments (and why naive overlap slows things down)

Key results:
• 0.57× communication compared to AdamW
• 1.1% optimizer overhead
• 50% less state memory

Write-up here

Reproducing and Validating Distributed Muon : A Practical Verification of Communication Efficiency Claims

I’m preparing a cleaned-up repo next. If you are experimenting with Muon, distributed optimizers, or multi-node scaling, happy to collaborate or cross-validate results.

Amber14L · November 30, 2025, 1:03pm

The optimizer overhead number made me grin because I have spent way too many nights fighting with bloated state updates. Seeing Muon behave this lean gives me a strange feeling of hope and mild jealousy at the same time.

bird-of-paradise · November 30, 2025, 6:39pm

Haha, the ‘bloat’ is the enemy of us all! While I haven’t battled 36B models personally (yet!), seeing the math play out was definitely a ‘hopeful’ moment.

I know whether to use Muon is tricky for fine-tuning (especially if you use LoRA), but if you ever want to mess around with the distributed setup or see the raw traces, I just pushed the full reproducibility suite

: bird-of-paradise/muon-distributed-reproducibility · Datasets at Hugging Face .

Good luck with that 36B model—that sounds like a beast to wrangle!

bird-of-paradise · December 12, 2025, 4:34am

I also published the report on Hugging Face Article

Reproducing and Validating Distributed Muon 🐢✨: A Practical Verification of Communication Efficiency Claims

Topic		Replies	Views
Distributed Muon – Reproducibility Dataset (Chrome Traces, Scripts, Figures, Logs) Show and Tell	0	13	November 30, 2025
Study Group: Implementing a Scalable, FSDP-Compatible Muon Optimizer Research	5	185	October 1, 2025
[Tutorial] Understanding and Implementing the Muon Optimizer Show and Tell	2	1688	November 7, 2025
My Muon Replication Journey — From Distributed Optimizers to a No-BS Training Glossary 🧩 Show and Tell	2	93	October 28, 2025
🚀 [tutorial]Update: Reverse-Engineering Breakdown Released — “The Muon is Scalable” (CPU-Friendly) Blueprint Show and Tell	0	42	November 7, 2025

Reproducing & Validating Distributed Muon (MoonshotAI) — Performance & Communication Results

Related topics