
β

β
π₯³ What is Lumos ?
TL; DR: Lumos is a pure vision-based generative framework, which confirms the feasibility and the scalability of learning visual generative priors. It can be efficiently adapted to visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.
CLICK for the full abstract
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive.
We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling.
Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner.
We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models.
We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning.
We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
πͺβ¨ Lumos Model Card

π Model Structure

Lumos consists of transformer blocks for latent diffusion, which is applied for various visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.
Source code is available at https://github.com/xiaomabufei/lumos.
π Model Description