The original post: /r/localllama by /u/HauntingMoment on 2025-01-06 12:58:59.

Hi Guys! Very excited to share Lighteval, the evaluation framework we use internally at Hugging Face. Here are the key features:

Here is how to get started fast and evaluate llama-3.1-70B-Instruct on the gsm8k benchmark and compare results with openai’s o1-mini!

pip install lighteval[vllm,litellm]
lighteval vllm "pretrained=meta-llama/Llama-3.1-70B-Instruct,dtype=bfloat16" "lighteval|gsm8k|5|1" --use-chat-template
lighteval endpoint litellm "o1-mini" "lighteval|gsm8k|5|1" --use-chat-template

If you have strong opinions on evaluation and think there are still things missing, don’t hesitate to help us; we would be delighted to have your help and build what will help us get better and safer AI.