MMLU doesn't match on lm-evaluation-harness

by yixinsong - opened Jul 18

Discussion

yixinsong

Jul 18

I evaluate the 1.7B models with lm-evaluation-harness framework.

I am curious about what causes the performance difference between lighteval and lm-evaluation-harness?

yixinsong changed discussion status to closed Jul 18

ldwang

Aug 14

Same question.

loubnabnl

Hugging Face TB Research org Aug 14

Hi, we use a different implementation of MMLU: cloze version vs MC, where we consider the log probabilities of entire answer sequences, instead of just single letters. You can find more details about this in this blog post and in appendix G.2 of this paper.

To reproduce our results you can use the guidelines here: https://huggingface.co./HuggingFaceFW/ablation-model-fineweb-edu#evaluation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment