Hi everyone!
I just published a full technical report where I reproduced MoonshotAI’s Distributed Muon optimizer, validated their communication efficiency claims, and profiled DP/TP configurations on a 4-GPU cluster.
The post includes:
• Full Muon DP=2/TP=2 and Adam profiling
• Perfetto traces (communication patterns)
• Memory analysis
• Two bug fixes to the open-source PoC
• Async-op experiments (and why naive overlap slows things down)
Key results:
• 0.57× communication compared to AdamW
• 1.1% optimizer overhead
• 50% less state memory
Write-up here
Reproducing and Validating Distributed Muon 
: A Practical Verification of Communication Efficiency Claims
I’m preparing a cleaned-up repo next. If you are experimenting with Muon, distributed optimizers, or multi-node scaling, happy to collaborate or cross-validate results.
1 Like
The optimizer overhead number made me grin because I have spent way too many nights fighting with bloated state updates. Seeing Muon behave this lean gives me a strange feeling of hope and mild jealousy at the same time.
2 Likes
Haha, the ‘bloat’ is the enemy of us all!
While I haven’t battled 36B models personally (yet!), seeing the math play out was definitely a ‘hopeful’ moment.
I know whether to use Muon is tricky for fine-tuning (especially if you use LoRA), but if you ever want to mess around with the distributed setup or see the raw traces, I just pushed the full reproducibility suite
: bird-of-paradise/muon-distributed-reproducibility · Datasets at Hugging Face .
Good luck with that 36B model—that sounds like a beast to wrangle!
1 Like