Check in here for tok/s and benchmarks for local gguf models
π Performance Benchmark: Qwen3-Coder-Next (GGUF Q4_K_M)
Model Quantization: (Q4_K_M)
Backend: LM Studio 0.4.1 (CUDA 12)
π» Hardware Specifications
| Component | Details |
|---|---|
| GPU | NVIDIA GeForce RTX 5080 (16GB GDDR7) |
| Driver | NVIDIA 590 Linux Driver (Latest Branch) |
| CPU | Intel Core Ultra 9 285K |
| RAM | 64GB DDR5 @ 6800 MT/s |
| OS | Fedora 43 Workstation Latest Kernel and Updates |
βοΈ Inference Settings
- Context Length: 60,000 Tokens
- Layer Offloading: 35 MoE Layers to CPU (Rest on GPU)
- KV Cache: Offloaded to GPU (Q8_0 Precision)
- CPU Threads: 8 Cores
- Features: Flash Attention ON
- Max Concurrency: 10
π Results
Testing performed with medium-sized coding prompts.
- Single Request:
40 - 45 tok/s - Concurrent (10 Requests):
- Per Request:
9 - 10 tok/s - Total Throughput:
~70 tok/s
RTX 3090 24GB + Ryzen 9 5950X 32GB RAM - single request, input 70K tokens / output 200 tokens
Qwen3-Coder-Next-Q3_K_XL
"Qwen3-Coder-Next-Q3_K_XL_100K":
cmd: |
/home/user/llama.cpp/build/bin/llama-server
--model /mnt/storage/GGUFs/Qwen3-Coder-Next-UD-Q3_K_XL.gguf
--no-warmup
--ctx-size 100000
--no-context-shift
# --n-gpu-layers 25 <--- do not use, let llama.cpp do its memory fitting algorithm magic! See here: https://deepwiki.com/search/im-confused-about-the-behavior_1eeb09c8-52c8-4c05-93fa-5bb9dce86b96
--temp 1
--top-p 0.95
--top-k 40
--repeat-penalty 1
--min-p 0
--jinja
--host 0.0.0.0
--port ${PORT}
--no-mmap
--flash-attn on
Memory usage: 21.8GB VRAM + 19.9GB RAM
Prompt processing speed: 567 t/s
Generation speed: 36.2 t/s
Qwen3-Coder-Next-Q4_K_XL
"Qwen3-Coder-Next-Q4_K_XL_90K":
cmd: |
/home/user/llama.cpp/build/bin/llama-server
--model /mnt/storage/GGUFs/Qwen3-Coder-Next-UD-Q4_K_XL.gguf
--no-warmup
--ctx-size 90000
--no-context-shift
# --n-gpu-layers 25 <--- do not use, let llama.cpp do its memory fitting algorithm magic! See here: https://deepwiki.com/search/im-confused-about-the-behavior_1eeb09c8-52c8-4c05-93fa-5bb9dce86b96
--temp 1
--top-p 0.95
--top-k 40
--repeat-penalty 1
--min-p 0
--jinja
--host 0.0.0.0
--port ${PORT}
--no-mmap
--flash-attn on
Memory usage: 21.7GB VRAM + 27.4GB RAM + 4.0GB swapfile (on a nvme ssd, read 3.4GB/s write 3.0GB/s)
Prompt processing speed: 460 t/s
Generation speed: 28.9 t/s
What tool did you use to produce benchmark report?
(in my case no bench tool, just some manual runs)
I used a script generated by Codex that inferenced the LMStudio API (it now supports concurrent requests- though in beta)
i should try with llama.cpp to see the difference
Backend : llama.cpp b8086 Cuda 13
Model : MXFP4_MOE
System
GPU : NVIDIA 5080 (16GB GDDR7)
Driver : Driver 591.86 Windows 11
CPU : 9800X3D
RAM : 64GB DDR5 @ 6200 MT/s
OS : Windows 11
-c 24576 ^
--batch-size 1024 ^
--ubatch-size 256 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^"
I tested it on a 750 line HA YAML with a task to optimise the code, and tweak the colours for a dark theme it did a great job to be honest.
7,441 tokens, 3min 1s, 40.97 tokens/s