Qwen3.5-9B-DFlash

Paper | GitHub | Blog

DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel, achieving up to 4.4x speedup over autoregressive decoding. This is the drafter model, which must be paired with Qwen/Qwen3.5-9B.

DFlash Architecture

Quick Start

Installation

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"

Launch Server

Use --speculative-num-draft-tokens to set the block size (8 or 16).

export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_DFLASH_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-9B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-9B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code

Tip: For long-context or agentic workloads, add --speculative-dflash-draft-window-size WINDOW_SIZE to enable sliding-window attention for the drafter.

Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
)
print(response.choices[0].message.content)

vLLM

Community-contributed support is available. See PRs #36847 and #36767 for details.

Benchmark Results

Setup: Single NVIDIA B200, SGLang, thinking enabled, max output length 4096. We report end-to-end throughput, including prefill time. See our GitHub repository for reproduction scripts.

Throughput and Speedup

DFlash outperforms MTP across all block sizes and concurrency levels, achieving up to 4.4x speedup at concurrency 1.

Tokens/sec (speedup vs. autoregressive baseline)

Block Size = 16

Task Concurrency AR MTP DFlash
Math500 1 197 379 (1.9x) 808 (4.1x)
8 1472 2569 (1.7x) 5114 (3.5x)
16 2831 4206 (1.5x) 7508 (2.7x)
32 4701 6028 (1.3x) 9286 (2.0x)
GSM8K 1 198 342 (1.7x) 697 (3.5x)
8 1470 2331 (1.6x) 4351 (3.0x)
16 2781 3794 (1.4x) 6325 (2.3x)
32 4581 5445 (1.2x) 7559 (1.6x)
HumanEval 1 193 378 (2.0x) 840 (4.4x)
8 1414 2461 (1.7x) 4837 (3.4x)
16 2638 3916 (1.5x) 6722 (2.5x)
32 4217 5423 (1.3x) 8285 (2.0x)
MBPP 1 194 335 (1.7x) 755 (3.9x)
8 1421 2064 (1.5x) 4202 (3.0x)
16 2667 3358 (1.3x) 5843 (2.2x)
32 4160 4610 (1.1x) 6961 (1.7x)
MT-Bench 1 194 297 (1.5x) 587 (3.0x)
8 1451 1945 (1.3x) 3611 (2.5x)
16 2787 3115 (1.1x) 5185 (1.9x)
32 4578 4453 (1.0x) 6225 (1.4x)
Alpaca 1 197 278 (1.4x) 545 (2.8x)
8 1460 1816 (1.2x) 3382 (2.3x)
16 2789 3009 (1.1x) 5002 (1.8x)
32 4574 4326 (1.0x) 6247 (1.4x)

Block Size = 8

Task Concurrency AR MTP DFlash
Math500 1 195 452 (2.3x) 664 (3.4x)
8 1458 3199 (2.2x) 4703 (3.2x)
16 2825 5390 (1.9x) 7804 (2.8x)
32 4712 7941 (1.7x) 11003 (2.3x)
GSM8K 1 196 421 (2.1x) 591 (3.0x)
8 1464 2954 (2.0x) 4106 (2.8x)
16 2775 4939 (1.8x) 6733 (2.4x)
32 4567 7246 (1.6x) 9375 (2.1x)
HumanEval 1 193 446 (2.3x) 667 (3.5x)
8 1411 3020 (2.1x) 4366 (3.1x)
16 2631 4884 (1.9x) 6815 (2.6x)
32 4077 6819 (1.7x) 8899 (2.2x)
MBPP 1 197 409 (2.1x) 634 (3.2x)
8 1440 2710 (1.9x) 3992 (2.8x)
16 2682 4435 (1.7x) 6128 (2.3x)
32 4152 6213 (1.5x) 8026 (1.9x)
MT-Bench 1 198 374 (1.9x) 525 (2.7x)
8 1478 2612 (1.8x) 3668 (2.5x)
16 2836 4323 (1.5x) 5905 (2.1x)
32 4617 6335 (1.4x) 8288 (1.8x)
Alpaca 1 196 360 (1.8x) 503 (2.6x)
8 1450 2497 (1.7x) 3493 (2.4x)
16 2802 4194 (1.5x) 5714 (2.0x)
32 4572 6175 (1.4x) 8077 (1.8x)

Acceptance Length

Format: MTP / DFlash

Task B8 B16
Math500 5.46 / 5.67 6.66 / 7.34
GSM8K 5.27 / 5.33 6.37 / 6.71
HumanEval 5.39 / 5.87 6.61 / 7.93
MBPP 4.78 / 5.31 5.49 / 6.62
MT-Bench 4.52 / 4.53 5.30 / 5.49
Alpaca 4.38 / 4.35 5.03 / 5.10

Acknowledgements

Special thanks to David Wang for his outstanding engineering support on this project. We are also grateful to Modal, InnoMatrix, and Yotta Labs for providing the compute resources used to train this draft model.

Citation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: DFlash Feedback.

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}
Downloads last month
3,766
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including z-lab/Qwen3.5-9B-DFlash

Paper for z-lab/Qwen3.5-9B-DFlash