[Feature Request] Adding an Additional Selector of Model Size and Releasing the Exact Command used for Evaluation

#62
by guanqun-yang - opened

As mentioned in the title, is it possible to add two additional features:

  • Adding an additional selector of size so people could choose the best model based on their hardware budget.
  • Now people have to go through the README.md of ElutherAI's repository. However, it is somewhat complicated and the exact task identifier and metrics you reported are not super clear (for example, what is "MMLU" and what specific metric you used for each task, is it acc or acc_norm?).

If it's of any help, MMLU is hendrycksTest-{sub} in lm-evaluation-harness where sub is a subtopic such as abstract_algebra. It has 57 different tasks that you can see in lm-evaluation-harness/lm_eval/tasks/hendrycks_test.py. You have to run and average acc_norm across tasks to get the numbers reported on the Open LLM Leaderboard.

Open LLM Leaderboard org

Hi!
The MMLU used is the one in the harness, and @itanh0b is completely correct in what they say about how to run it.
We also added a model parameter count in the view!

clefourrier changed discussion status to closed

Sign up or log in to comment