sft, base, Turbo-Shift1, turbo-shift3, base and other models

#1
by Regal - opened

Any chance of uploading the rest of the models, with and without aio?
:)

And the 0.6b LM, please.

The text encoders are also named in a confusing way, the 0.6b which is now available is the mandatory embedding model, whereas the 1.7b and 4.0b are the optional thinking LM models.

Regal changed discussion title from Turbo-Shift1, turbo-shift3, base and other models to Turbo-Shift1, turbo-shift3, base, sft and other models
Regal changed discussion title from Turbo-Shift1, turbo-shift3, base, sft and other models to sft, base, Turbo-Shift1, turbo-shift3, base and other models

And a quantised version for us paupers?

And a quantised version for us paupers?

Qwen3-coder:
The short answer is to directly convert AceStep (and most modern Transformers/DiT-based diffusion models) to the GGUF format for running via llama.cpp It's impossible now.
Here is a detailed technical explanation of why this is the case, and what alternatives do you have to save memory on your system (Ryzen 7 + 96 GB RAM).

  1. Why won't the conversion to GGUF work?
    GGUF format and engine llama.cpp optimized specifically for Transformer Decoder-only architecture (like Llama, Qwen, Mistral) used in text-based LLMs. They are able to efficiently perform quantization (4-bit, 8-bit) and inference for this particular structure.
    The AceStep V1.5 model has fundamental differences that break compatibility with the current GGUF stack.:

    DiT (Diffusion Transformer) Architecture:
    In the code (modeling_acestep_v15_turbo.py ) you see the AceStepDiTLayer class. It uses AdaLN (Adaptive Layer Normalization) and scaling via timestep_embedding.
    The standard attention layers in GGUF expect static weights. In DiT, the weights are dynamically modulated at each diffusion step depending on the time step t. llama.cpp it has no implementation of AdaLN operators and diffusion conditions.
    Specific components:
    The model uses Flow Matching (rather than the usual addition of noise), which requires specific integration logic (ODE/SDE solvers), which is not available in the LLM inference engine.
    The presence of FSQ (Finite Scalar Quantization) modules and specific encoders/decoders (AutoencoderOobleck, AceStepTimbreEncoder) makes the calculation graph incompatible with a simple sequential token pass that emulates GGUF.
    Lack of support in tools:

Utilities like convert-hf-to-gguf.py from the repository llama.cpp they just don't recognize the AceStepConditionGenerationModel architecture and will give an error or create a broken file, because they don't know how to map AceStepDiTLayer layers to GGML/GGUF operators.

  1. What about 8-bit quantization in general?
    Although the GGUF format is not available, quantization of the model weights is possible, but it requires other tools and approaches:
    Option A: Quantization via PyTorch (bitsandbytes / torchao)
    You can load a model into Python and quantize it on the fly, or save the quantized weights in PyTorch format (.pt or .safetensors), but not in GGUF.

    Tool: bitsandbytes library (supports 8-bit and 4-bit) or torchao (from Facebook, supports INT8 for CPU).

Sign up or log in to comment