Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

960

Normalization for MMLU-Pro doesn't make sense

#947

by ekurtic - opened 7 days ago

Discussion

ekurtic

7 days ago

Hi folks, I believe the way that MMLU-Pro scores are normalized is not correct. At the moment, normalization is done with assumption that there are 10 choices (therefore random baseline is 1/10), which is also what the MMLU-Pro paper claims to be. But after briefly inspecting test-set of MMLU-Pro (https://huggingface.co./datasets/TIGER-Lab/MMLU-Pro/viewer/default/test), one can notice that only 83% of questions have 10 choices (see "options" histogram in the HF dataset viewer). The other 17% of questions have anywhere from 3 to 9 choices.

alozowski

Open LLM Leaderboard org 7 days ago

Hi @ekurtic ,

It's a very interesting question! Let us think about it and we will get back to you

alozowski

Open LLM Leaderboard org 7 days ago

You're right – about 17% of MMLU-Pro questions have fewer than ten options. We chose to use a ten-option normalisation as a practical way to maintain consistency, even though it doesn’t perfectly fit every case

If you have the time, we would welcome your thoughts on how to improve our normalisation calculations. How would you approach correcting MMLU-Pro normalisation?

ekurtic

7 days ago

•

edited 7 days ago

EDIT: [Hi @alozowski , I still think that normalization of scores is the right approach. It's just that we should normalize per-question with 1/num_choices_for_that_question rather than normalizing the global score with 1/10.]
As for the implementation part, I haven't been able to find where this normalization step is implemented for OpenLLM Leaderboard. It is certainly not part of https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess

ekurtic

7 days ago

•

edited 7 days ago

^^I wanted to propose that we normalize per-group of questions that have the same number of choices.

So we would first group all questions with N choices, normalize their average score with 1/N. And we would do this for each N \in [1, 10].
Then we can compute an average of these grouped-by-N averages and report it as an overall score on MMLU-Pro. This way we make sure that questions with varying number of answers are normalized in the same way.

alozowski

Open LLM Leaderboard org 5 days ago

That's a good approach, thank you! I think we can implement this in the upcoming release, as well as add other fixes to results calculations

ekurtic

2 days ago

Thanks for taking the time to address the issue @alozowski !

alozowski

Open LLM Leaderboard org 2 days ago

Let me leave this discussion open so that we take it into account in the next issue of Leaderboard - feel free share any other ideas for score normalisation here if you want to

CombinHorizon

1 day ago

After that the differently [#N-choice question] scores are weighted, depending on the number of those questions

?: Is it better to give

more weight (multiplier) to the harder (more choices) questions (& by how much), or
to keep it flat (the same & fair) for all questions ?

(i'm guessing the 2nd option) ...

ekurtic

about 10 hours ago

@CombinHorizon I think 2nd option, as that is also used in other benchmarks like GPQA, BBH, etc.
@alozowski also, I wanted to mention that if the codebase for normalization is available somewhere, I could push a PR for this (in case you folks don't have spare cycles to work on it right now).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment