Check in here for tok/s and benchmarks for local gguf models

#1
by ykarout - opened

πŸš€ Performance Benchmark: Qwen3-Coder-Next (GGUF Q4_K_M)

Model Quantization: (Q4_K_M)

Backend: LM Studio 0.4.1 (CUDA 12)

πŸ’» Hardware Specifications

Component Details
GPU NVIDIA GeForce RTX 5080 (16GB GDDR7)
Driver NVIDIA 590 Linux Driver (Latest Branch)
CPU Intel Core Ultra 9 285K
RAM 64GB DDR5 @ 6800 MT/s
OS Fedora 43 Workstation Latest Kernel and Updates

βš™οΈ Inference Settings

  • Context Length: 60,000 Tokens
  • Layer Offloading: 35 MoE Layers to CPU (Rest on GPU)
  • KV Cache: Offloaded to GPU (Q8_0 Precision)
  • CPU Threads: 8 Cores
  • Features: Flash Attention ON
  • Max Concurrency: 10

πŸ“Š Results

Testing performed with medium-sized coding prompts.

  • Single Request: 40 - 45 tok/s
  • Concurrent (10 Requests):
  • Per Request: 9 - 10 tok/s
  • Total Throughput: ~70 tok/s

RTX 3090 24GB + Ryzen 9 5950X 32GB RAM - single request, input 70K tokens / output 200 tokens

Qwen3-Coder-Next-Q3_K_XL

  "Qwen3-Coder-Next-Q3_K_XL_100K":
    cmd: |
      /home/user/llama.cpp/build/bin/llama-server
      --model /mnt/storage/GGUFs/Qwen3-Coder-Next-UD-Q3_K_XL.gguf
      --no-warmup
      --ctx-size 100000
      --no-context-shift
      # --n-gpu-layers 25 <--- do not use, let llama.cpp do its memory fitting algorithm magic! See here: https://deepwiki.com/search/im-confused-about-the-behavior_1eeb09c8-52c8-4c05-93fa-5bb9dce86b96
      --temp 1
      --top-p 0.95
      --top-k 40
      --repeat-penalty 1
      --min-p 0
      --jinja
      --host 0.0.0.0
      --port ${PORT}
      --no-mmap
      --flash-attn on

Memory usage: 21.8GB VRAM + 19.9GB RAM
Prompt processing speed: 567 t/s
Generation speed: 36.2 t/s

Qwen3-Coder-Next-Q4_K_XL

  "Qwen3-Coder-Next-Q4_K_XL_90K":
    cmd: |
      /home/user/llama.cpp/build/bin/llama-server
      --model /mnt/storage/GGUFs/Qwen3-Coder-Next-UD-Q4_K_XL.gguf
      --no-warmup
      --ctx-size 90000
      --no-context-shift
      # --n-gpu-layers 25 <--- do not use, let llama.cpp do its memory fitting algorithm magic! See here: https://deepwiki.com/search/im-confused-about-the-behavior_1eeb09c8-52c8-4c05-93fa-5bb9dce86b96
      --temp 1
      --top-p 0.95
      --top-k 40
      --repeat-penalty 1
      --min-p 0
      --jinja
      --host 0.0.0.0
      --port ${PORT}
      --no-mmap
      --flash-attn on

Memory usage: 21.7GB VRAM + 27.4GB RAM + 4.0GB swapfile (on a nvme ssd, read 3.4GB/s write 3.0GB/s)
Prompt processing speed: 460 t/s
Generation speed: 28.9 t/s

What tool did you use to produce benchmark report?

(in my case no bench tool, just some manual runs)

I used a script generated by Codex that inferenced the LMStudio API (it now supports concurrent requests- though in beta)
i should try with llama.cpp to see the difference

Backend : llama.cpp b8086 Cuda 13
Model : MXFP4_MOE

System
GPU : NVIDIA 5080 (16GB GDDR7)
Driver : Driver 591.86 Windows 11
CPU : 9800X3D
RAM : 64GB DDR5 @ 6200 MT/s
OS : Windows 11

-c 24576 ^
--batch-size 1024 ^
--ubatch-size 256 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^"

I tested it on a 750 line HA YAML with a task to optimise the code, and tweak the colours for a dark theme it did a great job to be honest.

7,441 tokens, 3min 1s, 40.97 tokens/s

Sign up or log in to comment