Hi everyone,
I have fine-tuned many large language models (LLMs) using the Alpaca dataset, and now I want to evaluate these fine-tuned models and compare their performance.
I’m looking for advice on the following:
- Evaluation methods or platforms suitable for assessing fine-tuned LLMs.
- Datasets to use for evaluation.
- Fair comparison of fine-tuning effects. I want to maximize the comparison of the fine-tuning effect for each model while minimizing the impact of differences between the base models themselves. That is, I want the evaluation to reflect how much the fine-tuning improves each model, rather than just the inherent strength of some base models.
I’d greatly appreciate any suggestions on:
- Evaluation strategies that isolate the fine-tuning effect from base model differences.
- Metrics, datasets, or platforms suitable for fair comparison across models.
Thank you in advance for your insights!
1 Like
Hi,thank you for your suggestions.But now I have another problem:
- Server Cannot Access Hugging Face – How to Evaluate LLM with Local Models and Datasets?
I am trying to use the LightEval platform to evaluate my models on benchmarks such as MMLU, TruthfulQA, ARC, HellaSwag, etc.
However, my server cannot directly access Hugging Face to download models or datasets.
I have tried using LightEval’s custom task methods, but I was not successful.
What I hope to achieve:
-
Use a local model path
-
Use a local dataset path (Download form Huggingface)
Run LightEval evaluation entirely offline, without requiring network access
-
Are there any code examples or templates that show how to evaluate LLMs with local models and datasets?
-
Or any alternative ways to perform similar LLM benchmark evaluations in a fully offline environment?
I would greatly appreciate any advice or examples. Thank you!
1 Like
If Hugging Face’s cache is available, I think it can be implemented relatively simply…