Configuration Parsing Warning:Config file model_index.json cannot be fetched (too big)
Simple Diffusion XS (train in progress)
At AiArtLab, we strive to create a free, compact and fast model that can be trained on consumer graphics cards.
- Unet: 1.6b parameters
- Qwen3.5: 1.8b parameters
- VAE: 32ch8x16x
- Speed: Sampling: 100%|██████████| 40/40 [00:01<00:00, 29.98it/s]
- Resolution: from 768px to 1404px, with step 64px
- Limitations: trained on small dataset ~1-2kk, focused on illustrations
Key points
- Dec 24: Started research on Linear Transformers.
- Feb 25: Started research on UNet-based diffusion models.
- Aug 25: Started research on different VAEs.
- Sep 25: Created a simple VAE and a vae collection.
- Dec 25: Trained SDXS-1B (0.8B at this moment), featuring an SD1.5-like UNet, Long CLIP, 16-channel simple VAE, and flow matching target.
- Jan 25: Implemented a dual text encoder (SDXL-like style). Total rework.
- Feb 25: Reverted to classic architecture; tested all SDXL innovations and went back to simple diffusion. Total rework.
- Mar 25: Created an 32ch 8x/16x asymmetric VAE and switched to Qwen3.5 2B as text encoder.
Samples with seed 0
Text 2 image
import torch
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
pipe_id = "AiArtLab/sdxs-1b"
pipe = DiffusionPipeline.from_pretrained(
pipe_id,
torch_dtype=dtype,
trust_remote_code=True
).to(device)
prompt = "girl, smiling, red eyes, blue hair, white shirt"
negative_prompt="low quality"
image = pipe(
prompt=prompt,
negative_prompt = negative_prompt,
).images[0]
image.show(image)
Image upscale
upscaled = pipe.image_upscale("media/girl.jpg")
upscaled[0].show()
Prompt refine
refined = pipe.refine_prompts("girl")
print(refined)
Encode image (experimental)
emb, mask = pipe.encode_image("media/girl.jpg")
# Проверяем
print("Pooled vector shape:", emb[:, 0, :].shape)
image = pipeline(
prompt_embeds = emb,
prompt_attention_mask = mask,
negative_prompt = negative_prompt,
guidance_scale = 4,
width = 1088,
height = 1344,
seed = 0,
batch_size = 1,
)[0]
image[0].show()
VAE
The VAE in Simple Diffusion utilizes an asymmetric VAE architecture featuring an 8x encoder and a 16x decoder. While a compression factor of 8 is maintained during training, the resolution is effectively doubled during inference through an additional upscaling block. This strategy reduces training costs by an order of magnitude and boosts inference speed without perceptual quality loss. Effectively, this acts as an integrated latent upscaler. To ensure a fair comparison with other VAEs, we downsampled the generated images to match the input resolution for metric evaluation. The SDXS VAE was not trained from scratch but was initialized from weights of FLUX 2 VAE, then redisigned and retrained. We also trained 16 ch vae with flux.1 quality based on aura vae.
8x scale factor
SDXL | MSE=1.925e-03 PSNR=30.00 LPIPS=0.123 Edge=0.181 KL=32.113
FLUX.1 | MSE=4.098e-04 PSNR=36.06 LPIPS=0.033 Edge=0.083 KL=13.127
FLUX.2 | MSE=2.425e-04 PSNR=38.33 LPIPS=0.023 Edge=0.065 KL=2.160
16x scale factor
Wan2.2-TI2V-5B (2Gb) | MSE=7.034e-04 PSNR=34.65 LPIPS=0.050 Edge=0.115 KL=9.429
sdxs-1b (200Mb) | MSE=2.655e-04 PSNR=37.83 LPIPS=0.026 Edge=0.066 KL=2.170
Image upscale
One interesting feature of the asymmetric VAE is the ability to use it as a standalone image and video upscaler. This VAE was trained at resolutions of 512–768 pixels and is effective within this range. It should be noted that this is a latent upscaler, making it simple and fast. It is a "blind" upscaler; unlike model-based upscalers, it interferes with the process minimally and does not alter the essence of the image. This may be useful if you dislike it when upscalers change the image style or phone model—inventing something new based on the original image. On the other hand, you might not like it, as it changes the original minimally.
Unet
The UNet architecture in Simple Diffusion is a direct descendant and conceptual continuation of the ideas introduced in the first version of Stable Diffusion. Key distinctions include a relatively small, yet sufficient, number of transformer blocks that ensure an even distribution of attention. Additionally, the number of channels in the final layer has been significantly increased to improve detail rendering. Overall, however, it remains a UNet, similar to SD 1.5.
Throughout the experiments, we tested hundreds of different configurations and trained dozens of models. Notably, we initially started from the SDXL architecture, assuming it would be a stronger baseline, but ultimately abandoned all the innovations proposed in it. These included uneven attention distribution with increased transformer block depth in the lower layers, a reduced number of blocks in the channel pyramid, micro-conditioning, the dual text encoder, text-time and so on. According to our experiments, all of these changes lead to increased training time and costs while having a near-zero or negative impact on the final result. In total, the investigation of various architectures and the search for the most efficient and optimal configuration took over a year.
Unfortunately, we were unable to secure grants for model training, with the exception of a grant from Google TPU—which, unfortunately, we were unable to utilize due to insufficient preparation and time constraints. As a result, training and experiments were financed primarily from our own funds and user donations. This left a significant mark on the model’s architecture. We aimed to make it as small and cost-effective to train as possible while maintaining our quality generation requirements. So perhaps the limited budget even worked to our advantage.
Nevertheless, we remain hopeful for continued community support, which would allow us to further develop the model while remaining as independent as possible.
Text encoder
We tested various text encoders, including—but not limited to — CLIP, LongCLIP, SigLIP, MexmaSigLIP, Qwen3-0.6B, Qwen3-0.6B embeddings, Qwen3-1.7B. Ultimately, we settled on Qwen3.5-2B, which demonstrated unprecedented improvements in both quality and training speed. We’d also like to highlight the LongCLIP model—its training speed is comparable to Qwen3.5, which is remarkable for its size.
Embeddings are extracted from the -2 layer, with a pooling layer from the last token as the first element. This was done to improve both composition and versatility — for example, it allows using a pooling layer from images instead of a textual instruction. Training was conducted with a maximum of 250 tokens, and a 10% dropout rate was applied during training.
Additionally, the use of a full-fledged language model allowed us to integrate an optional prompt enhancement mechanism into the pipeline.
Retrospective and Key Takeaways
The Journey Begins
This adventure started in December 2024 after the release of the SANA model. We received a donation from Stan for fine-tuning SANA and, together with Stas, began fine-tuning and further developing it. Despite spending the entire budget, we did not achieve significant improvements. However, we were shocked by how poorly the model was trained and designed, and we became convinced that we could do better—though we were wrong. Shifting Gears By February 2025, we split our efforts and began designing our own architectures—which we are still doing today. Stas favored the DiT architecture, while I believed in UNet. Despite some differences in architectural views, we maintained close communication, shared our work, and supported each other throughout the process. We also engaged with the AIArtLab community (a virtual Telegram chat for those contributing to model development)—thank you all for your support.
Main mistake
One of my key mistakes was relying too heavily on LLMs and research papers. Research often presents minor improvements as groundbreaking innovations, and LLMs, trained on such content, can draw incorrect conclusions. From autumn 2025, I radically changed my strategy, switching to training simpler models (VAEs), where simple fine-tuning yielded more substantial improvements than expensive research projects—including fine-tuning a VAE to a quality level comparable to Flux-1 at the time. This shift led me to adopt a zero-trust policy toward any external information not personally verified. This does not mean that you should not read papers, but I urge you not to trust the conclusions presented in them. This is an extremely radical approach, and I have intentionally radicalized it, but it allowed me to transition from reading papers and implementing other people's ideas to generating my own and training models.
As a result, I focused on building a strong local benchmark for rapid, cost-effective experiments on single rtx4080. This led me to train models on the "Butterflies" dataset—a set of 1,000 images of butterflies—where a model could be trained from scratch in just an hour to assess the impact of a hypothesis or improvement, example.
The Evolutionary Path
The second turning point was the transition to a continuous evolutionary improvement strategy. Unfortunately, the Butterflies dataset does not allow for evaluating prompt-following or anatomical generation capabilities. As a result, the model evolved incrementally rather than through revolutionary changes. The same model, from December 2025, underwent around 10 changes, including radical architectural shifts—while always preserving the pre-trained weights. It’s remarkable how well and quickly pre-trained models adapt to changes in architecture and external factors, even radical ones (e.g., switching VAE models, text encoders, or their combinations). In addition to saving on training costs, this approach helped maintain minimal model size—for example, adding extra transformer blocks followed by an assessment of necessity and rolling back if the changes had no significant impact.
tldr; Main idea
Stop reading, start training
The Role of Hyperparameters
One of the initial mistakes was an excessive focus on hyperparameters during training. Ironically, 80% of training speed and quality depend on the model architecture (UNet) and the quality of embeddings (VAE), while other 20% is influenced by the text encoder’s embeddings. The rest is Role of Hyperparameters. The irony here is that Adam (adamw8bit) is surprisingly forgiving of hyperparameter errors, so I won’t even list them. Default is ok.
Tools and Optimization
The model comes with two scripts:
A dataset script to convert a folder of image-text pairs into latent representations. A training script provided as a single monolithic file. Additionally, there’s a script that can be pasted directly into the terminal to automatically train the model with optimized parameters.
Training Optimization
All training was done using the AdamW8bit optimizer, which significantly reduced training costs.
Train:
apt update
apt install git-lfs
git config --global credential.helper store
git clone https://huggingface.co/AiArtLab/sdxs-1b
cd sdxs-1b
pip install -r requirements.txt -U
mkdir datasets
cd datasets
hf download babkasotona/ds1234_noanime_704_vae8x16x --repo-type dataset --local-dir ds1234_noanime_704_vae8x16x
cd ..
nohup accelerate launch train.py &
Model Limitations:
- Limited concept coverage due to the small dataset (1kk).
Acknowledgments
- Stan — Key investor. Thank you for believing in us when others called it madness.
- the last neural cell - Thank you for providing 8xH100 for 48 hours
- Captainsaturnus
- Love. Death. Transformers.
- TOPAPEC
Datasets
Donations
- Rubles: For users from Russia
- DOGE: DEw2DR8C7BnF8GgcrfTzUjSnGkuMeJhg83
- BTC: 3JHv9Hb8kEW8zMAccdgCdZGfrHeMhH1rpN
- Crypto: https://nowpayments.io/donation/sdxs
Contacts
Please contact with us if you may provide some GPU's or money on training
- telegram recoilme *prefered way
- mail at aiartlab.org (slow response)
mail at aiartlab.org (slow response)
Citation
@misc{sdxs,
title={Simple Diffusion XS},
author={recoilme, muinez and AiArtLab Team},
url={https://huggingface.co/AiArtLab/sdxs-1b},
year={2026}
}
- Downloads last month
- 137

