How to Evaluate Fine-Tuned LLMs?

stringofu · November 27, 2025, 3:40am

Hi everyone,

I have fine-tuned many large language models (LLMs) using the Alpaca dataset, and now I want to evaluate these fine-tuned models and compare their performance.

I’m looking for advice on the following:

Evaluation methods or platforms suitable for assessing fine-tuned LLMs.
Datasets to use for evaluation.
Fair comparison of fine-tuning effects. I want to maximize the comparison of the fine-tuning effect for each model while minimizing the impact of differences between the base models themselves. That is, I want the evaluation to reflect how much the fine-tuning improves each model, rather than just the inherent strength of some base models.

I’d greatly appreciate any suggestions on:

Evaluation strategies that isolate the fine-tuning effect from base model differences.
Metrics, datasets, or platforms suitable for fair comparison across models.

Thank you in advance for your insights!

John6666 · November 27, 2025, 4:29am

LightEval or so?

stringofu · November 27, 2025, 7:32am

Hi，thank you for your suggestions.But now I have another problem:

Server Cannot Access Hugging Face – How to Evaluate LLM with Local Models and Datasets?

I am trying to use the LightEval platform to evaluate my models on benchmarks such as MMLU, TruthfulQA, ARC, HellaSwag, etc.

However, my server cannot directly access Hugging Face to download models or datasets.
I have tried using LightEval’s custom task methods, but I was not successful.

What I hope to achieve:

Use a local model path
Use a local dataset path (Download form Huggingface)

Run LightEval evaluation entirely offline, without requiring network access

Are there any code examples or templates that show how to evaluate LLMs with local models and datasets?
Or any alternative ways to perform similar LLM benchmark evaluations in a fully offline environment?

I would greatly appreciate any advice or examples. Thank you!

John6666 · November 27, 2025, 9:14am

If Hugging Face’s cache is available, I think it can be implemented relatively simply…

Topic		Replies	Views
Evaluating my own model Intermediate	6	187	February 21, 2025
Evaluating pretrained model Beginners	0	324	July 26, 2021
Causal LLM benchmarks Beginners	0	481	June 13, 2023
How to run Llama 3.1 benchmark Models	0	81	September 2, 2024
Is it possible to evaluate generations/output while fine-tuning a LLM? 🤗Transformers	2	762	November 1, 2023

How to Evaluate Fine-Tuned LLMs?

Related topics