🤗 LLM-Perf Leaderboard 🏋️

👆 Hover over the points 👆 for additional information.

📊 We only show the top 90% LLMs based on latency ⌛

👆 Hover over the points 👆 for additional information.

📊 We only show the top 90% LLMs based on latency ⌛

👆 Hover over the points 👆 for additional information.

📊 We only show the top 90% LLMs based on latency ⌛

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
togethercomputer/RedPajama-INCITE-Base-3B-v1	4bit-gptq-exllama-v2-eager	0.087	108.613	15763.828	1063016.711	27.20*

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
microsoft/Phi-3-mini-4k-instruct	8bit-bnb-eager	0.087	11.702	4344.502	294097.016	27.20*
microsoft/Phi-3-mini-4k-instruct	8bit-bnb-fa2	0.084	11.952	4343.977	304937.349	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-bnb-fa2	0.19	20.034	2984.741	460631.897	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-bnb-eager	0.179	20.358	2984.872	458687.722	27.20*
microsoft/Phi-3-mini-4k-instruct	float16-fa2	0.057	25.728	7924.799	482526.655	27.20
microsoft/Phi-3-mini-4k-instruct	bfloat16-fa2	0.057	25.777	7924.799	495012.449	27.20
microsoft/Phi-3-mini-4k-instruct	4bit-awq-gemm-fa2	0.074	25.822	2588.969	617546.882	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-awq-gemm-eager	0.067	26.175	2589.625	635611.672	27.20*
microsoft/Phi-3-mini-4k-instruct	float32-eager	0.127	28.46	15763.828	406314.612	27.20

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
TencentARC/Mistral_Pro_8B_v0.1	4bit-gptq-exllama-v2-eager	2.486	147.371	65311.037	1212590.352	29.56*

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
Qwen/Qwen1.5-110B	4bit-gptq-exllama-v2-eager	2.486	0.398	65311.037	2661.495	29.56*
Qwen/Qwen1.5-110B	4bit-gptq-exllama-v1-fa2	2.513	0.421	65311.036	2633.025	29.56*
Qwen/Qwen1.5-110B	4bit-gptq-exllama-v1-eager	2.515	0.424	65311.037	2662.679	29.56*
Qwen/Qwen1.5-110B	4bit-gptq-exllama-v2-sdpa	2.499	0.425	65311.036	2666.191	29.56*
Qwen/Qwen1.5-110B	4bit-gptq-exllama-v1-sdpa	2.48	0.428	65311.036	2664.976	29.56*
Qwen/Qwen1.5-110B	4bit-bnb-fa2	4.467	3.363	65013.93	23268.535	29.56*
Qwen/Qwen1.5-110B	4bit-bnb-eager	4.446	3.835	65014.062	25487.07	29.56*
Qwen/Qwen1.5-110B	4bit-bnb-sdpa	4.436	3.848	65013.93	26017.077	29.56*
Qwen/Qwen1.5-32B	4bit-gptq-exllama-v1-fa2	0.759	1.4	21326.311	8742.68	26.69*
Qwen/Qwen1.5-32B	4bit-gptq-exllama-v2-fa2	0.758	1.4	21326.311	8749.008	26.69*
Qwen/Qwen1.5-32B	4bit-gptq-exllama-v1-eager	0.749	1.435	21326.312	8958.627	26.69*
Qwen/Qwen1.5-32B	4bit-gptq-exllama-v2-eager	0.748	1.437	21326.312	8980.23	26.69*
Qwen/Qwen1.5-32B	4bit-gptq-exllama-v1-sdpa	0.744	1.438	21326.311	8995.519	26.69*
Qwen/Qwen1.5-32B	4bit-gptq-exllama-v2-sdpa	0.742	1.44	21326.311	9019.264	26.69*
Qwen/Qwen1.5-32B	8bit-bnb-eager	0.215	4.64	35661.209	37808.933	26.69*
Qwen/Qwen1.5-32B	8bit-bnb-fa2	0.212	4.679	35661.209	38609.39	26.69*
Qwen/Qwen1.5-32B	8bit-bnb-sdpa	0.207	4.803	35661.209	39412.893	26.69*
Qwen/Qwen1.5-32B	4bit-bnb-fa2	1.231	7.65	21184.84	57873.387	26.69*
Qwen/Qwen1.5-32B	4bit-bnb-eager	1.221	7.962	21184.971	58872.682	26.69*
Qwen/Qwen1.5-32B	4bit-bnb-sdpa	1.216	8.237	21184.84	61166.327	26.69*
Qwen/Qwen1.5-32B	bfloat16-sdpa	0.113	18.422	66512.805	114101.162	26.69
01-ai/Yi-34B	4bit-gptq-exllama-v1-fa2	0.806	1.337	20339.706	8398.815	22.26*
01-ai/Yi-34B	4bit-gptq-exllama-v1-eager	0.802	1.349	20339.707	8446.395	22.26*
01-ai/Yi-34B	4bit-gptq-exllama-v2-fa2	0.797	1.353	20339.706	8502.111	22.26*
01-ai/Yi-34B	4bit-gptq-exllama-v1-sdpa	0.793	1.358	20339.706	8507.675	22.26*
01-ai/Yi-34B	4bit-gptq-exllama-v2-sdpa	0.791	1.359	20339.706	8497.601	22.26*
01-ai/Yi-34B	4bit-gptq-exllama-v2-eager	0.795	1.361	20339.707	8496.697	22.26*
01-ai/Yi-34B	8bit-bnb-fa2	0.222	4.63	35777.361	40260.489	22.26*
01-ai/Yi-34B	8bit-bnb-sdpa	0.207	4.797	35784.527	39300.981	22.26*
01-ai/Yi-34B	8bit-bnb-eager	0.208	4.803	35784.558	39386.631	22.26*
01-ai/Yi-34B	4bit-bnb-eager	1.265	8.253	20257.332	60268.352	22.26*
01-ai/Yi-34B	4bit-bnb-fa2	1.263	8.336	20257.201	61121.736	22.26*
01-ai/Yi-34B	4bit-bnb-sdpa	1.259	8.549	20257.201	61089.017	22.26*
01-ai/Yi-34B	bfloat16-eager	0.139	15.957	69113.77	98579.367	22.26
01-ai/Yi-34B	float16-eager	0.139	16.108	69113.741	100686.062	22.26
01-ai/Yi-34B	float16-sdpa	0.135	17.306	69113.726	108446.829	22.26
01-ai/Yi-34B	bfloat16-sdpa	0.131	17.41	69113.726	109196.143	22.26
01-ai/Yi-34B	bfloat16-fa2	0.129	17.872	69106.595	110149.156	22.26
01-ai/Yi-34B	float16-fa2	0.131	17.94	69106.595	111925.191	22.26
Qwen/Qwen1.5-14B	4bit-gptq-exllama-v2-fa2	0.322	3.307	11417.443	20680.95	20.22*
Qwen/Qwen1.5-14B	4bit-gptq-exllama-v1-fa2	0.319	3.343	11417.443	20946.25	20.22*
Qwen/Qwen1.5-14B	4bit-gptq-exllama-v2-eager	0.318	3.411	11417.444	21276.966	20.22*
Qwen/Qwen1.5-14B	4bit-gptq-exllama-v1-eager	0.317	3.412	11417.444	21318.34	20.22*
Qwen/Qwen1.5-14B	4bit-gptq-exllama-v2-sdpa	0.315	3.413	11417.443	21325.519	20.22*
Qwen/Qwen1.5-14B	4bit-gptq-exllama-v1-sdpa	0.316	3.415	11417.443	21284.671	20.22*
Qwen/Qwen1.5-14B	8bit-bnb-eager	0.133	7.632	17162.983	63327.692	20.22*
Qwen/Qwen1.5-14B	8bit-bnb-fa2	0.131	7.672	17162.139	62918.239	20.22*
Qwen/Qwen1.5-14B	8bit-bnb-sdpa	0.127	7.885	17162.139	65452.264	20.22*
Qwen/Qwen1.5-14B	4bit-bnb-eager	0.502	12.995	11093.767	101028.853	20.22*
Qwen/Qwen1.5-14B	4bit-bnb-fa2	0.511	13.067	11094.619	100869.91	20.22*
Qwen/Qwen1.5-14B	4bit-bnb-sdpa	0.501	13.606	11094.619	106419.525	20.22*
Qwen/Qwen1.5-7B	4bit-gptq-exllama-v2-fa2	0.178	6.104	7110.584	38185.725	15.22*
Qwen/Qwen1.5-7B	4bit-gptq-exllama-v1-fa2	0.176	6.156	7110.584	38429.543	15.22*

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
togethercomputer/RedPajama-INCITE-Base-3B-v1	4bit-gptq-exllama-v2-eager	0.094	100.902	11624.262	1009135.618	27.20*

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
microsoft/Phi-3-mini-4k-instruct	8bit-bnb-eager	0.094	10.872	4344.502	242054.673	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-bnb-eager	0.519	18.653	2984.872	380579.469	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-awq-gemm-eager	0.223	24.443	2589.101	530788.272	27.20*
microsoft/Phi-3-mini-4k-instruct	float16-eager	0.112	27.025	7924.275	525313.012	27.20
microsoft/Phi-3-mini-4k-instruct	bfloat16-eager	0.703	27.159	7924.275	523070.356	27.20
microsoft/Phi-3-mini-4k-instruct	4bit-gptq-exllama-v2-eager	0.12	28.539	2674.711	614602.889	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-awq-gemv-eager	0.218	28.565	2589.101	583495.271	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-awq-exllama-v1-eager	1.719	29	2569.852	597058.673	27.20*
microsoft/Phi-3-mini-4k-instruct	4bit-gptq-exllama-v1-eager	0.12	29.673	2674.711	616350.878	27.20*
microsoft/Phi-3-mini-4k-instruct	8bit-bnb-eager	0.094	10.872	4344.502	242054.673	25.97*
microsoft/Phi-3-mini-4k-instruct	4bit-bnb-eager	0.519	18.653	2984.872	380579.469	25.97*
microsoft/Phi-3-mini-4k-instruct	4bit-awq-gemm-eager	0.223	24.443	2589.101	530788.272	25.97*
microsoft/Phi-3-mini-4k-instruct	float16-eager	0.112	27.025	7924.275	525313.012	25.97
microsoft/Phi-3-mini-4k-instruct	bfloat16-eager	0.703	27.159	7924.275	523070.356	25.97
microsoft/Phi-3-mini-4k-instruct	4bit-gptq-exllama-v2-eager	0.12	28.539	2674.711	614602.889	25.97*
microsoft/Phi-3-mini-4k-instruct	4bit-awq-gemv-eager	0.218	28.565	2589.101	583495.271	25.97*

We tested the 32vCPU AWS C7i instance for the benchmark. The memory requirement is the max RAM consumption during the decode phase.

👆 Hover over the points 👆 for additional information.

📊 We only show the top 90% LLMs based on latency ⌛

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
meta-llama/Meta-Llama-3-8B-Instruct	bfloat16-eager-onnxruntime	1.149	105.028	33874.035	10315773.503	27.20

Model 🤗	Experiment 🧪	Prefill (s)	Decode (tokens/s)	Memory (MB)	Energy (tokens/kWh)	Open LLM Score (%)
microsoft/Phi-3-mini-4k-instruct	float16-eager-onnxruntime	1.149	9.161	33874.035	507061.519	27.20
microsoft/Phi-3-mini-4k-instruct	float32-sdpa-onnxruntime	1.148	9.167	33883.009	506519.539	27.20
microsoft/Phi-3-mini-4k-instruct	float16-sdpa-onnxruntime	1.149	9.176	34152.235	555187.043	27.20
microsoft/Phi-3-mini-4k-instruct	float32-eager-onnxruntime	1.146	9.178	34165.006	507484.453	27.20
microsoft/Phi-3-mini-4k-instruct	bfloat16-eager-onnxruntime	1.149	9.182	33952.588	506845.367	27.20
microsoft/Phi-3-mini-4k-instruct	bfloat16-sdpa-onnxruntime	1.155	9.195	33879.2	506373.04	27.20
microsoft/Phi-3-mini-4k-instruct	float32-eager	0.919	9.344	16873.779	512276.739	27.20
microsoft/Phi-3-mini-4k-instruct	float16-eager	0.871	9.873	8469.733	539138.011	27.20
microsoft/Phi-3-mini-4k-instruct	bfloat16-eager	1.256	11.956	8466.649	657268.304	27.20
microsoft/Phi-3-mini-4k-instruct	float16-eager-onnxruntime	1.149	9.161	33874.035	507061.519	25.97
microsoft/Phi-3-mini-4k-instruct	float32-sdpa-onnxruntime	1.148	9.167	33883.009	506519.539	25.97
microsoft/Phi-3-mini-4k-instruct	float16-sdpa-onnxruntime	1.149	9.176	34152.235	555187.043	25.97
microsoft/Phi-3-mini-4k-instruct	float32-eager-onnxruntime	1.146	9.178	34165.006	507484.453	25.97
microsoft/Phi-3-mini-4k-instruct	bfloat16-eager-onnxruntime	1.149	9.182	33952.588	506845.367	25.97
microsoft/Phi-3-mini-4k-instruct	bfloat16-sdpa-onnxruntime	1.155	9.195	33879.2	506373.04	25.97
microsoft/Phi-3-mini-4k-instruct	float32-eager	0.919	9.344	16873.779	512276.739	25.97
microsoft/Phi-3-mini-4k-instruct	float16-eager	0.871	9.873	8469.733	539138.011	25.97
microsoft/Phi-3-mini-4k-instruct	bfloat16-eager	1.256	11.956	8466.649	657268.304	25.97
meta-llama/Meta-Llama-3-8B-Instruct	float32-eager	1.699	5.09	32943.063	277950.628	23.91
meta-llama/Meta-Llama-3-8B-Instruct	float32-sdpa	1.725	5.091	32936.903	278184.653	23.91
meta-llama/Meta-Llama-3-8B-Instruct	float16-eager	2.05	5.351	16698.143	294123.326	23.91
meta-llama/Meta-Llama-3-8B-Instruct	float16-sdpa	2.858	5.358	16643.273	294020.878	23.91
meta-llama/Meta-Llama-3-8B-Instruct	bfloat16-eager	2.323	6.679	16693.383	365665.788	23.91
meta-llama/Meta-Llama-3-8B-Instruct	bfloat16-sdpa	2.247	6.875	16651.837	377575.188	23.91
meta-llama/Meta-Llama-3-8B-Instruct	float32-eager	1.699	5.09	32943.063	277950.628	20.48
meta-llama/Meta-Llama-3-8B-Instruct	float32-sdpa	1.725	5.091	32936.903	278184.653	20.48
meta-llama/Meta-Llama-3-8B-Instruct	float16-eager	2.05	5.351	16698.143	294123.326	20.48
meta-llama/Meta-Llama-3-8B-Instruct	float16-sdpa	2.858	5.358	16643.273	294020.878	20.48
meta-llama/Meta-Llama-3-8B-Instruct	bfloat16-eager	2.323	6.679	16693.383	365665.788	20.48
meta-llama/Meta-Llama-3-8B-Instruct	bfloat16-sdpa	2.247	6.875	16651.837	377575.188	20.48
Qwen/Qwen1.5-14B	float32-eager-pytorch	3.5	2.887	59387.22	158110.391	20.22
Qwen/Qwen1.5-14B	float32-sdpa-pytorch	3.47	2.909	59333.816	159916.958	20.22
Qwen/Qwen1.5-14B	float16-eager-pytorch	3.846	3.209	29988.094	176226.264	20.22
Qwen/Qwen1.5-14B	float16-sdpa-pytorch	4.604	3.338	29949.034	177309.797	20.22
Qwen/Qwen1.5-14B	bfloat16-eager-pytorch	5.044	3.926	29972.107	216699.353	20.22
Qwen/Qwen1.5-14B	bfloat16-sdpa-pytorch	4.416	4.096	29976.404	224269.419	20.22
HuggingFaceH4/zephyr-7b-beta	float32-eager	1.519	3.623	29801.099	116328.384	17.72
HuggingFaceH4/zephyr-7b-beta	float32-sdpa	1.509	3.643	29786.874	117459.64	17.72
HuggingFaceH4/zephyr-7b-beta	float16-eager	1.587	4.898	15118.303	158205.184	17.72
HuggingFaceH4/zephyr-7b-beta	float16-sdpa	1.509	4.993	15168.705	160109.479	17.72
HuggingFaceH4/zephyr-7b-beta	bfloat16-eager	1.93	5.349	15078.601	172531.761	17.72
HuggingFaceH4/zephyr-7b-beta	bfloat16-sdpa	1.953	5.541	15032.537	176415.801	17.72
google/gemma-7b	float32-eager-pytorch	2.313	4.411	37929.214	241400.985	15.28
google/gemma-7b	float32-sdpa-pytorch	2.236	4.586	37925.974	251620.707	15.28
google/gemma-7b	float16-eager-pytorch	2.291	4.746	19251.56	260714.33	15.28
google/gemma-7b	float16-sdpa-pytorch	3.105	4.84	19250.573	267309.007	15.28
google/gemma-7b	bfloat16-eager-pytorch	3.146	5.927	19252.556	322630.72	15.28
google/gemma-7b	bfloat16-sdpa-pytorch	3.024	6.171	19237.421	338891.098	15.28
Qwen/Qwen1.5-7B	float32-sdpa-pytorch	1.973	5.151	33052.492	280862.056	15.22
Qwen/Qwen1.5-7B	float32-eager-pytorch	1.948	5.178	33068.073	280176.795	15.22
Qwen/Qwen1.5-7B	float16-eager-pytorch	2.196	5.411	16826.806	299520.759	15.22

📝 About

The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benhcmark.

Anyone from the community can request a new base model or hardware/backend/optimization configuration for automated benchmarking:

Model evaluation requests should be made in the 🤗 Open LLM Leaderboard 🏅 ; we scrape the list of canonical base models from there.
Hardware/Backend/Optimization configuration requests should be made in the 🤗 LLM-Perf Leaderboard 🏋️ or Optimum-Benhcmark repository (where the code is hosted).

✍️ Details

To avoid communication-dependent results, only one GPU is used.
Score is the average evaluation score obtained from the 🤗 Open LLM Leaderboard
LLMs are running on a singleton batch with a prompt size of 256 and generating a 64 tokens for at least 10 iterations and 10 seconds.
Energy consumption is measured in kWh using CodeCarbon and taking into consideration the GPU, CPU, RAM and location of the machine.
We measure three types of memory: Max Allocated Memory, Max Reserved Memory and Max Used Memory. The first two being reported by PyTorch and the last one being observed using PyNVML.

All of our benchmarks are ran by this single script benchmark_cuda_pytorch.py using the power of Optimum-Benhcmark to garantee reproducibility and consistency.