π€ LLM-Perf Leaderboard ποΈ
0 100
0 81920
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
4bit-gptq-exllama-v2-eager | 0.087 | 108.613 | 15763.828 | 1063016.711 | 27.20* |
π Hover over the points π for additional information.
π We only show the top 90% LLMs based on latency β
0 100
0 81920
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
4bit-gptq-exllama-v2-eager | 2.486 | 147.371 | 65311.037 | 1212590.352 | 29.56* |
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
4bit-gptq-exllama-v2-eager | 2.486 | 0.398 | 65311.037 | 2661.495 | 29.56* | |
4bit-gptq-exllama-v1-fa2 | 2.513 | 0.421 | 65311.036 | 2633.025 | 29.56* | |
4bit-gptq-exllama-v1-eager | 2.515 | 0.424 | 65311.037 | 2662.679 | 29.56* | |
4bit-gptq-exllama-v2-sdpa | 2.499 | 0.425 | 65311.036 | 2666.191 | 29.56* | |
4bit-gptq-exllama-v1-sdpa | 2.48 | 0.428 | 65311.036 | 2664.976 | 29.56* | |
4bit-bnb-fa2 | 4.467 | 3.363 | 65013.93 | 23268.535 | 29.56* | |
4bit-bnb-eager | 4.446 | 3.835 | 65014.062 | 25487.07 | 29.56* | |
4bit-bnb-sdpa | 4.436 | 3.848 | 65013.93 | 26017.077 | 29.56* | |
4bit-gptq-exllama-v1-fa2 | 0.759 | 1.4 | 21326.311 | 8742.68 | 26.69* | |
4bit-gptq-exllama-v2-fa2 | 0.758 | 1.4 | 21326.311 | 8749.008 | 26.69* | |
4bit-gptq-exllama-v1-eager | 0.749 | 1.435 | 21326.312 | 8958.627 | 26.69* | |
4bit-gptq-exllama-v2-eager | 0.748 | 1.437 | 21326.312 | 8980.23 | 26.69* | |
4bit-gptq-exllama-v1-sdpa | 0.744 | 1.438 | 21326.311 | 8995.519 | 26.69* | |
4bit-gptq-exllama-v2-sdpa | 0.742 | 1.44 | 21326.311 | 9019.264 | 26.69* | |
8bit-bnb-eager | 0.215 | 4.64 | 35661.209 | 37808.933 | 26.69* | |
8bit-bnb-fa2 | 0.212 | 4.679 | 35661.209 | 38609.39 | 26.69* | |
8bit-bnb-sdpa | 0.207 | 4.803 | 35661.209 | 39412.893 | 26.69* | |
4bit-bnb-fa2 | 1.231 | 7.65 | 21184.84 | 57873.387 | 26.69* | |
4bit-bnb-eager | 1.221 | 7.962 | 21184.971 | 58872.682 | 26.69* | |
4bit-bnb-sdpa | 1.216 | 8.237 | 21184.84 | 61166.327 | 26.69* | |
bfloat16-sdpa | 0.113 | 18.422 | 66512.805 | 114101.162 | 26.69 | |
4bit-gptq-exllama-v1-fa2 | 0.806 | 1.337 | 20339.706 | 8398.815 | 22.26* | |
4bit-gptq-exllama-v1-eager | 0.802 | 1.349 | 20339.707 | 8446.395 | 22.26* | |
4bit-gptq-exllama-v2-fa2 | 0.797 | 1.353 | 20339.706 | 8502.111 | 22.26* | |
4bit-gptq-exllama-v1-sdpa | 0.793 | 1.358 | 20339.706 | 8507.675 | 22.26* | |
4bit-gptq-exllama-v2-sdpa | 0.791 | 1.359 | 20339.706 | 8497.601 | 22.26* | |
4bit-gptq-exllama-v2-eager | 0.795 | 1.361 | 20339.707 | 8496.697 | 22.26* | |
8bit-bnb-fa2 | 0.222 | 4.63 | 35777.361 | 40260.489 | 22.26* | |
8bit-bnb-sdpa | 0.207 | 4.797 | 35784.527 | 39300.981 | 22.26* | |
8bit-bnb-eager | 0.208 | 4.803 | 35784.558 | 39386.631 | 22.26* | |
4bit-bnb-eager | 1.265 | 8.253 | 20257.332 | 60268.352 | 22.26* | |
4bit-bnb-fa2 | 1.263 | 8.336 | 20257.201 | 61121.736 | 22.26* | |
4bit-bnb-sdpa | 1.259 | 8.549 | 20257.201 | 61089.017 | 22.26* | |
bfloat16-eager | 0.139 | 15.957 | 69113.77 | 98579.367 | 22.26 | |
float16-eager | 0.139 | 16.108 | 69113.741 | 100686.062 | 22.26 | |
float16-sdpa | 0.135 | 17.306 | 69113.726 | 108446.829 | 22.26 | |
bfloat16-sdpa | 0.131 | 17.41 | 69113.726 | 109196.143 | 22.26 | |
bfloat16-fa2 | 0.129 | 17.872 | 69106.595 | 110149.156 | 22.26 | |
float16-fa2 | 0.131 | 17.94 | 69106.595 | 111925.191 | 22.26 | |
4bit-gptq-exllama-v2-fa2 | 0.322 | 3.307 | 11417.443 | 20680.95 | 20.22* | |
4bit-gptq-exllama-v1-fa2 | 0.319 | 3.343 | 11417.443 | 20946.25 | 20.22* | |
4bit-gptq-exllama-v2-eager | 0.318 | 3.411 | 11417.444 | 21276.966 | 20.22* | |
4bit-gptq-exllama-v1-eager | 0.317 | 3.412 | 11417.444 | 21318.34 | 20.22* | |
4bit-gptq-exllama-v2-sdpa | 0.315 | 3.413 | 11417.443 | 21325.519 | 20.22* | |
4bit-gptq-exllama-v1-sdpa | 0.316 | 3.415 | 11417.443 | 21284.671 | 20.22* | |
8bit-bnb-eager | 0.133 | 7.632 | 17162.983 | 63327.692 | 20.22* | |
8bit-bnb-fa2 | 0.131 | 7.672 | 17162.139 | 62918.239 | 20.22* | |
8bit-bnb-sdpa | 0.127 | 7.885 | 17162.139 | 65452.264 | 20.22* | |
4bit-bnb-eager | 0.502 | 12.995 | 11093.767 | 101028.853 | 20.22* | |
4bit-bnb-fa2 | 0.511 | 13.067 | 11094.619 | 100869.91 | 20.22* | |
4bit-bnb-sdpa | 0.501 | 13.606 | 11094.619 | 106419.525 | 20.22* | |
4bit-gptq-exllama-v2-fa2 | 0.178 | 6.104 | 7110.584 | 38185.725 | 15.22* | |
4bit-gptq-exllama-v1-fa2 | 0.176 | 6.156 | 7110.584 | 38429.543 | 15.22* |
π Hover over the points π for additional information.
π We only show the top 90% LLMs based on latency β
0 100
0 81920
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
4bit-gptq-exllama-v2-eager | 0.094 | 100.902 | 11624.262 | 1009135.618 | 27.20* |
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
8bit-bnb-eager | 0.094 | 10.872 | 4344.502 | 242054.673 | 27.20* | |
4bit-bnb-eager | 0.519 | 18.653 | 2984.872 | 380579.469 | 27.20* | |
4bit-awq-gemm-eager | 0.223 | 24.443 | 2589.101 | 530788.272 | 27.20* | |
float16-eager | 0.112 | 27.025 | 7924.275 | 525313.012 | 27.20 | |
bfloat16-eager | 0.703 | 27.159 | 7924.275 | 523070.356 | 27.20 | |
4bit-gptq-exllama-v2-eager | 0.12 | 28.539 | 2674.711 | 614602.889 | 27.20* | |
4bit-awq-gemv-eager | 0.218 | 28.565 | 2589.101 | 583495.271 | 27.20* | |
4bit-awq-exllama-v1-eager | 1.719 | 29 | 2569.852 | 597058.673 | 27.20* | |
4bit-gptq-exllama-v1-eager | 0.12 | 29.673 | 2674.711 | 616350.878 | 27.20* | |
8bit-bnb-eager | 0.094 | 10.872 | 4344.502 | 242054.673 | 25.97* | |
4bit-bnb-eager | 0.519 | 18.653 | 2984.872 | 380579.469 | 25.97* | |
4bit-awq-gemm-eager | 0.223 | 24.443 | 2589.101 | 530788.272 | 25.97* | |
float16-eager | 0.112 | 27.025 | 7924.275 | 525313.012 | 25.97 | |
bfloat16-eager | 0.703 | 27.159 | 7924.275 | 523070.356 | 25.97 | |
4bit-gptq-exllama-v2-eager | 0.12 | 28.539 | 2674.711 | 614602.889 | 25.97* | |
4bit-awq-gemv-eager | 0.218 | 28.565 | 2589.101 | 583495.271 | 25.97* |
π Hover over the points π for additional information.
π We only show the top 90% LLMs based on latency β
We tested the 32vCPU AWS C7i instance for the benchmark. The memory requirement is the max RAM consumption during the decode phase.
0 100
0 81920
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
bfloat16-eager-onnxruntime | 1.149 | 105.028 | 33874.035 | 10315773.503 | 27.20 |
Model π€ | Experiment π§ͺ | Prefill (s) | Decode (tokens/s) | Memory (MB) | Energy (tokens/kWh) | Open LLM Score (%) |
|---|---|---|---|---|---|---|
float16-eager-onnxruntime | 1.149 | 9.161 | 33874.035 | 507061.519 | 27.20 | |
float32-sdpa-onnxruntime | 1.148 | 9.167 | 33883.009 | 506519.539 | 27.20 | |
float16-sdpa-onnxruntime | 1.149 | 9.176 | 34152.235 | 555187.043 | 27.20 | |
float32-eager-onnxruntime | 1.146 | 9.178 | 34165.006 | 507484.453 | 27.20 | |
bfloat16-eager-onnxruntime | 1.149 | 9.182 | 33952.588 | 506845.367 | 27.20 | |
bfloat16-sdpa-onnxruntime | 1.155 | 9.195 | 33879.2 | 506373.04 | 27.20 | |
float32-eager | 0.919 | 9.344 | 16873.779 | 512276.739 | 27.20 | |
float16-eager | 0.871 | 9.873 | 8469.733 | 539138.011 | 27.20 | |
bfloat16-eager | 1.256 | 11.956 | 8466.649 | 657268.304 | 27.20 | |
float16-eager-onnxruntime | 1.149 | 9.161 | 33874.035 | 507061.519 | 25.97 | |
float32-sdpa-onnxruntime | 1.148 | 9.167 | 33883.009 | 506519.539 | 25.97 | |
float16-sdpa-onnxruntime | 1.149 | 9.176 | 34152.235 | 555187.043 | 25.97 | |
float32-eager-onnxruntime | 1.146 | 9.178 | 34165.006 | 507484.453 | 25.97 | |
bfloat16-eager-onnxruntime | 1.149 | 9.182 | 33952.588 | 506845.367 | 25.97 | |
bfloat16-sdpa-onnxruntime | 1.155 | 9.195 | 33879.2 | 506373.04 | 25.97 | |
float32-eager | 0.919 | 9.344 | 16873.779 | 512276.739 | 25.97 | |
float16-eager | 0.871 | 9.873 | 8469.733 | 539138.011 | 25.97 | |
bfloat16-eager | 1.256 | 11.956 | 8466.649 | 657268.304 | 25.97 | |
float32-eager | 1.699 | 5.09 | 32943.063 | 277950.628 | 23.91 | |
float32-sdpa | 1.725 | 5.091 | 32936.903 | 278184.653 | 23.91 | |
float16-eager | 2.05 | 5.351 | 16698.143 | 294123.326 | 23.91 | |
float16-sdpa | 2.858 | 5.358 | 16643.273 | 294020.878 | 23.91 | |
bfloat16-eager | 2.323 | 6.679 | 16693.383 | 365665.788 | 23.91 | |
bfloat16-sdpa | 2.247 | 6.875 | 16651.837 | 377575.188 | 23.91 | |
float32-eager | 1.699 | 5.09 | 32943.063 | 277950.628 | 20.48 | |
float32-sdpa | 1.725 | 5.091 | 32936.903 | 278184.653 | 20.48 | |
float16-eager | 2.05 | 5.351 | 16698.143 | 294123.326 | 20.48 | |
float16-sdpa | 2.858 | 5.358 | 16643.273 | 294020.878 | 20.48 | |
bfloat16-eager | 2.323 | 6.679 | 16693.383 | 365665.788 | 20.48 | |
bfloat16-sdpa | 2.247 | 6.875 | 16651.837 | 377575.188 | 20.48 | |
float32-eager-pytorch | 3.5 | 2.887 | 59387.22 | 158110.391 | 20.22 | |
float32-sdpa-pytorch | 3.47 | 2.909 | 59333.816 | 159916.958 | 20.22 | |
float16-eager-pytorch | 3.846 | 3.209 | 29988.094 | 176226.264 | 20.22 | |
float16-sdpa-pytorch | 4.604 | 3.338 | 29949.034 | 177309.797 | 20.22 | |
bfloat16-eager-pytorch | 5.044 | 3.926 | 29972.107 | 216699.353 | 20.22 | |
bfloat16-sdpa-pytorch | 4.416 | 4.096 | 29976.404 | 224269.419 | 20.22 | |
float32-eager | 1.519 | 3.623 | 29801.099 | 116328.384 | 17.72 | |
float32-sdpa | 1.509 | 3.643 | 29786.874 | 117459.64 | 17.72 | |
float16-eager | 1.587 | 4.898 | 15118.303 | 158205.184 | 17.72 | |
float16-sdpa | 1.509 | 4.993 | 15168.705 | 160109.479 | 17.72 | |
bfloat16-eager | 1.93 | 5.349 | 15078.601 | 172531.761 | 17.72 | |
bfloat16-sdpa | 1.953 | 5.541 | 15032.537 | 176415.801 | 17.72 | |
float32-eager-pytorch | 2.313 | 4.411 | 37929.214 | 241400.985 | 15.28 | |
float32-sdpa-pytorch | 2.236 | 4.586 | 37925.974 | 251620.707 | 15.28 | |
float16-eager-pytorch | 2.291 | 4.746 | 19251.56 | 260714.33 | 15.28 | |
float16-sdpa-pytorch | 3.105 | 4.84 | 19250.573 | 267309.007 | 15.28 | |
bfloat16-eager-pytorch | 3.146 | 5.927 | 19252.556 | 322630.72 | 15.28 | |
bfloat16-sdpa-pytorch | 3.024 | 6.171 | 19237.421 | 338891.098 | 15.28 | |
float32-sdpa-pytorch | 1.973 | 5.151 | 33052.492 | 280862.056 | 15.22 | |
float32-eager-pytorch | 1.948 | 5.178 | 33068.073 | 280176.795 | 15.22 | |
float16-eager-pytorch | 2.196 | 5.411 | 16826.806 | 299520.759 | 15.22 |
π Hover over the points π for additional information.
π We only show the top 90% LLMs based on latency β
π About
The π€ LLM-Perf Leaderboard ποΈ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benhcmark.
Anyone from the community can request a new base model or hardware/backend/optimization configuration for automated benchmarking:
- Model evaluation requests should be made in the π€ Open LLM Leaderboard π ; we scrape the list of canonical base models from there.
- Hardware/Backend/Optimization configuration requests should be made in the π€ LLM-Perf Leaderboard ποΈ or Optimum-Benhcmark repository (where the code is hosted).
βοΈ Details
- To avoid communication-dependent results, only one GPU is used.
- Score is the average evaluation score obtained from the π€ Open LLM Leaderboard
- LLMs are running on a singleton batch with a prompt size of 256 and generating a 64 tokens for at least 10 iterations and 10 seconds.
- Energy consumption is measured in kWh using CodeCarbon and taking into consideration the GPU, CPU, RAM and location of the machine.
- We measure three types of memory: Max Allocated Memory, Max Reserved Memory and Max Used Memory. The first two being reported by PyTorch and the last one being observed using PyNVML.
All of our benchmarks are ran by this single script benchmark_cuda_pytorch.py using the power of Optimum-Benhcmark to garantee reproducibility and consistency.