arxiv:2603.01068

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

Published on Mar 1

· Submitted by

Zebin You on Mar 3

GSAI-ML

Upvote

Authors:

Abstract

LLaDA-o is an omni diffusion model that uses a Mixture of Diffusion framework to jointly handle text understanding and visual generation through a shared attention backbone, achieving state-of-the-art performance in multimodal tasks.

AI-generated summary

We present LLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.

View arXiv page View PDF GitHub 9 Add to collection

Community

yyyou

Paper submitter 1 day ago

We present LLaDA-o, a length-adaptive MoD omni-diffusion model that unifies multimodal understanding and image generation, reaching 87.04 on DPG-Bench; we will open-source the model and code in https://github.com/ML-GSAI/LLaDA-o.