Spaces:
Running
on
CPU Upgrade
Tool: Adding evaluation results to model cards
Hi everyone,
I have created a tool that incorporates the evaluation results (as pr) into the model card, displayed as follows:
Evaluation Results (Open LLM Leaderboard)
Metric | Value |
---|---|
Avg. | {results['Average ⬆️']} |
ARC (25-shot) | {results['ARC']} |
HellaSwag (10-shot) | {results['HellaSwag']} |
MMLU (5-shot) | {results['MMLU']} |
TruthfulQA (0-shot) | {results['TruthfulQA']} |
Winogrande (5-shot) | {results['Winogrande']} |
GSM8K (5-shot) | {results['GSM8K']} |
DROP (3-shot) | {results['DROP']} |
I believe this is suitable for presenting evaluation results in the model card. I am prepared to submit pull requests to every model included in this leaderboard and automate this tool if you find it beneficial. @clefourrier @SaylorTwift , could you please share your thoughts on this? Thanks.
Example pull request:
https://huggingface.co./PulsarAI/Dolphin2.1-OpenOrca-7B/discussions/2/files
Hi
@Weyaxi
!
I really like this idea, it's very cool! Thank you for suggesting this!
We have something a bit similar on our todo, but it's in the batch for beginning of next year, and if you create this tool it will give us a headstart then - so if you have the bandwidth to work on it we'll integrate it :)
Suggestion: you could also add the path/link to the details dataset for the corresponding model in your PR (and, small nit, make sure the title says "Evaluation" ^^)
Hi @Weyaxi !
I really like this idea, it's very cool! Thank you for suggesting this!
We have something a bit similar on our todo, but it's in the batch for beginning of next year, and if you create this tool it will give us a headstart then - so if you have the bandwidth to work on it we'll integrate it :)
Hi @clefourrier , thanks for your comments. Should I begin by opening the pull requests and writing the code to automate this process every x hours?
Suggestion: you could also add the path/link to the details dataset for the corresponding model in your PR (and, small nit, make sure the title says "Evaluation" ^^)
Thanks for suggestion. Will do.
Regarding the typo issue, I am quite shocked and embarrassed that I didn't notice it earlier. I copy-pasted this format into nearly 40 repositories 😅. I will fix that as well.
About the details dataset, how can I display it? Is this approach acceptable?
Evaluation Results (📑 Detailed Results) (Open LLM Leaderboard)
Hi @Weyaxi ! Thank you for this, as @clefourrier said it was in our backlog, so that's a really neat add :)
Should I begin by opening the pull requests and writing the code to automate this process every x hours?
Writing the code would be great ! We will help you integrate it as soon as we have some bandwidth.
As for the display, this looks fine to me !
The best would be to have:
Open LLM Leaderboard Evaluation Results (details)
But I don't know how to make a header clickable (with correct font) in markdown.
Hi @SaylorTwift , sorry for the late reply. I can adapt the format as requested, but I'm not completely sure about your instructions. Should I proceed with opening pull requests)? Also I have questions about the details dataset. Can you clarify the "public" thing in the result dataset? Which link should I provide to the user? There is approximately 300 models that contains this "public" thing. For example, for the Stellarbright model, there are two results datasets:
I think @SaylorTwift meant that if you have a code base, we can probably help you put it in a space (or even in the leaderboard) to open PRs automatically on the new models in case you were doing it manually.
Re the public
key word, I think the logging system changed recently, and we went from having model_name
to model_name_public
in the datasets names. I'll fix it, so you can assume there is no public
in the names :)
I think @SaylorTwift meant that if you have a code base, we can probably help you put it in a space (or even in the leaderboard) to open PRs automatically on the new models in case you were doing it manually.
The code is ready to go; there's just a readme issue. I plan to create space for manual additions today (as a draft). After that, I don't believe connecting it to a dataset and automating this process will take long.
Re the
public
key word, I think the logging system changed recently, and we went from havingmodel_name
tomodel_name_public
in the datasets names. I'll fix it, so you can assume there is nopublic
in the names :)
Okay, assuming there is nothing called public
in datasets :)
Problem
Well, I also discovered a weird issue: I can't include more than one link in a model card as a header. (You are able to do so in discussions.)
Example Code
# [Huggingface](https://huggingface.co.) [Github](https://github.com)
Video File
Btw, thank you for your interest, and I apologize if I am asking too many questions about this. I want to make things as perfect as possible :)
Hi!
Don't worry about asking questions, it's completely OK :)
Re- the link issue, I'd suggest something in the following spirit.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | {results['Average ⬆️']} |
ARC (25-shot) | {results['ARC']} |
HellaSwag (10-shot) | {results['HellaSwag']} |
MMLU (5-shot) | {results['MMLU']} |
TruthfulQA (0-shot) | {results['TruthfulQA']} |
Winogrande (5-shot) | {results['Winogrande']} |
GSM8K (5-shot) | {results['GSM8K']} |
DROP (3-shot) | {results['DROP']} |
Hi @clefourrier , I have created the space for manual opening PRs and implemented your format.
Space app link
https://huggingface.co./spaces/Weyaxi/open-llm-leaderboard-results-pr
Example usage and result
https://huggingface.co./PulsarAI/Dolphin2.1-OpenOrca-7B/discussions/13
My planning
I am planning to open the first PRs (≈1700) from Colab or my local machine.
After the first phase is finished, I will automate the process to open new PRs for new models using HF Spaces.
My Questions
Is the planning okay with you?
Which user/organization should open PRs?
This looks great, and this planning is perfect, thank you very much for your help with this!
I assume you are using the token to open the PRs on repos?
I think you could create an account with a token for this specifically, that you could call "leaderboard_PR_bot" for example. But if you are fine doing it from your account you can also do so :)
Hi,
Thanks for your comments.
To be clear, I do not want to open 2000 pull requests from my account 😅. I will follow your suggestion and open an account. I am starting ASAP 🚀🚀
I opened the account at:
https://huggingface.co./leaderboard-pr-bot
However, I have been rate-limited after only two PRs. 😥
Oops 😱 You've been rate limited. For safety reasons, we limit the number of write operations for new users. Please try again in 24 hours or get in touch with us at [email protected] if you need access now.
How many days should a new account be on the whitelist? Do you know the answer to this question, @clefourrier ? Thanks.
Hi @clefourrier ,
With the help of @meganariley , the rate limit issue has been resolved, and I have created approximately 1500 PRs.
Now I will automate this :)
Btw, here is the dataset of the resulting PRs:
https://huggingface.co./datasets/Weyaxi/open-llm-leaderboard-results-pr
This is absolutely amazing! Great idea to add a PR dataset to follow this :)
hi @Weyaxi thanks a ton for your work, it's really cool to see your community-centric initiative and it improves the model documentation on the Hub for the whole community which is 😍
One suggestion i have: we have a set of metadata to encode eval results directly in the model card metadata. It is not super well documented but it looks like this: https://github.com/huggingface/hub-docs/blob/5e1389e1782a676f249fcc4c2798748aa0a0b4a5/modelcard.md?plain=1#L22-L44
(the metadata spec is originally from Papers with code)
And the metadata, if present, "feeds" this section on the right column in model cards:
What do you think of supporting this metadata format, in addition to the human-readable table you add in your PRs?
(cc'ing @osanseviero and @Wauplin too)
(from a brainstorming with @clefourrier : in addition to the existing metadata we could add a property to link to the "originating leaderboard" (e.g. a URL))
To complete
@julien-c
's comment, it is possible to use the huggingface_hub
library to add eval results to a model card metadata. Here are some documentation: https://huggingface.co./docs/huggingface_hub/guides/model-cards#include-evaluation-results (and package reference).
EDIT: sorry @clefourrier I posted without refreshing the page, hence missing your post 😁
(Everybody's super enthusiastic about your tool as you can see 🤗 )
Wow there is so much message :)
hi @Weyaxi thanks a ton for your work, it's really cool to see your community-centric initiative and it improves the model documentation on the Hub for the whole community which is 😍
One suggestion i have: we have a set of metadata to encode eval results directly in the model card metadata. It is not super well documented but it looks like this: https://github.com/huggingface/hub-docs/blob/5e1389e1782a676f249fcc4c2798748aa0a0b4a5/modelcard.md?plain=1#L22-L44
(the metadata spec is originally from Papers with code)
And the metadata, if present, "feeds" this section on the right column in model cards:
What do you think of supporting this metadata format, in addition to the human-readable table you add in your PRs?
(cc'ing @osanseviero and @Wauplin too)
Hi @julien-c ,
Thank you for your interest. The suggestion is excellent, and I would like to implement it.
I tried the metadata feature in this model. Is this what you meant? Unfortunately, I failed to link the evaluation results to paperswithcode thing :(
(from a brainstorming with @clefourrier : in addition to the existing metadata we could add a property to link to the "originating leaderboard" (e.g. a URL))
To be honest, I didn't understand what you are saying here. Do you mean linking to the leaderboard when the evaluation results are clicked?
And @Wauplin just shared this super cool doc link on how to include these results easily in the model card using
huggingface_hub
Thanks, will check that.
(Everybody's super enthusiastic about your tool as you can see 🤗 )
Yeah, I am glad about that 🤗. Thanks to all the people who are interested in this :)
Good job on making the first step work, it looks very nice!
Re the brainstorming aspect: at the moment, results appear as self-reported and link to Papers with Code, but it would be cool (in the future) to point to the Open LLM Leaderboard instead. Maybe @Wauplin has ideas on how best to do this?
at the moment, results appear as self-reported and link to Papers with Code, but it would be cool (in the future) to point to the Open LLM Leaderboard instead. Maybe @Wauplin has ideas on how best to do this?
This is a change that has to be done in the backend I guess. Not sure how to best handle it. Maybe a metadata key in the readme stating that the evals come from the Open LLM Leaderboard? Any opinion @julien-c ?
But even without the link to the Leaderboard it's already very nice to have the results in the widget! Will be really nice once it gets automated :)
We also got the suggestion from people on the datasets side to only open one PR per organization or user at a time: if it gets merged, then the bot can send others, but if not the bot can probably skip the edits.
Again, feel free to tell us if you need a hand in any place :)
Good job on making the first step work, it looks very nice!
So, are you saying that this representation is good? Did I understand correctly?
Re the brainstorming aspect: at the moment, results appear as self-reported and link to Papers with Code, but it would be cool (in the future) to point to the Open LLM Leaderboard instead. Maybe @Wauplin has ideas on how best to do this?
I think pointing to the leaderboard instead of Paperswithcode will be good too.
This is a change that has to be done in the backend I guess. Not sure how to best handle it. Maybe a metadata key in the readme stating that the evals come from the Open LLM Leaderboard? Any opinion @julien-c ?
I think so.
But even without the link to the Leaderboard it's already very nice to have the results in the widget! Will be really nice once it gets automated :)
I agree, thank you very much for your comments.
We also got the suggestion from people on the datasets side to only open one PR per organization or user at a time: if it gets merged, then the bot can send others, but if not the bot can probably skip the edits.
Again, feel free to tell us if you need a hand in any place :)
Nice suggestion, and I can implement it, but I have already opened so many PRs. I apologize to everyone if the bot has been bothersome. :(
My Question
I love the idea of adding evaluation results as a widget, but I've already opened 1500 PRs. It's not a problem for me to open another 1500 (or even 10k) PRs if you're okay with that. Alternatively, we can implement this widget thing after a certain day, etc. In conclusion, it's all up to you, and I'll be happy to help :)
So, are you saying that this representation is good? Did I understand correctly?
Well it looks like it's working very well :) Last point to fix is this "pointing to the leaderboard instead of Paper with code" aspect, let's wait for some feedback on that.
... I have already opened so many PRs. I apologize to everyone if the bot has been bothersome. :(
Don't worry, we did not get complaints! This suggestion comes from the experience the datasets team has running automatic bots on the hub, and is there to make the experience smoother for everyone.
Well it looks like it's working very well :)
It seems to be working well for me too :)
... Last point to fix is this "pointing to the leaderboard instead of Paper with code" aspect, let's wait for some feedback on that.
Works for me :)
Don't worry, we did not get complaints! This suggestion comes from the experience the datasets team has running automatic bots on the hub, and is there to make the experience smoother for everyone.
Okay. I am happy about that. Thanks for letting me know :)
Any update on this?
(worth mentioning that the above documentation is part of this PR: https://github.com/huggingface/hub-docs/pull/1144. Not yet merged but the format itself shouldn't change)
Hi @Wauplin and @clefourrier ,
Thanks for your interest.
I have edited the test repository. Could you please confirm whether this format is correct, and the only changes needed are the numbers (values) and the model name at the top (for diffrent models)?
https://huggingface.co./Weyaxi/metadata-test/raw/main/README.md
@Weyaxi
Related to the ongoing PR to support eval results sources on the Hub, I have opened this PR on the zephy-7b-beta model card: https://huggingface.co./HuggingFaceH4/zephyr-7b-beta/discussions/40
I am not sure yet if the metadata is correct (feedback's welcome for that) but I think we are close to have something good. Please let me know what you think about it.
@Weyaxi Related to the ongoing PR to support eval results sources on the Hub, I have opened this PR on the zephy-7b-beta model card: https://huggingface.co./HuggingFaceH4/zephyr-7b-beta/discussions/40
@Wauplin , in my opinion, this is quite nice. I have written a few things regarding pending questions in that PR. I think we are close to have something good too! :)
Hi @Wauplin and @clefourrier
https://huggingface.co./Weyaxi/metadata-test (Removed DROP btw)
Is this the final and correct template? The only changes will be these (for unique models), right?
- model-index.name
- Details source URL
- Numerical results for each task
Hi! It looks very good! One last check on my side:
@Wauplin
what are the possible fields for task_type
? Because not all these tasks are generative, but I'm not sure there is an option to say "log prob eval" for example
@Wauplin what are the possible fields for task_type ?
It would be cool to select the task_type
for our list of tasks here https://huggingface.co./tasks. Do you see something that would be more appropriate there?
(tasks list can also be requested as json: https://huggingface.co./api/tasks)
Is this the final and correct template?
@Weyaxi
all source.url
fields are set to https://huggingface.co./datasets/open-llm-leaderboard/details_
. For a real model it would have to be fixed. Other than that I think it's good!
As as note, if you start to create/update PRs automatically, make sure that you don't overwrite an existing eval result in the model index. I expect that some models have reported there own results and we want to add new ones from the Open LLM Leaderboard without removing the old ones.
Is this the final and correct template?
@Weyaxi all
source.url
fields are set tohttps://huggingface.co./datasets/open-llm-leaderboard/details_
. For a real model it would have to be fixed. Other than that I think it's good!
That was a test model so I leaved like that. It will be fixed for a normal model :)
As as note, if you start to create/update PRs automatically, make sure that you don't overwrite an existing eval result in the model index. I expect that some models have reported there own results and we want to add new ones from the Open LLM Leaderboard without removing the old ones.
Well, I am confused. Are you trying to say that the bot shouldn't open PRs for models that have already reported their own results (e.g., zephyr-7b-beta)? If that is the case, no problem for me. Sorry if I misunderstood that.
My question:
What should the bot do to add metrics as a widget? Should the bot open new pull requests (PRs) for every model in the leaderboard again, or just from now on? (Edit existing PRs maybe? I know it is possible, but I currently do not know how to code that.)
Interesting! Then all tasks should probably be Question Answering, except for GSM8K which is a Text Generation - but I don't think it matters that much, probably no need to change this.
don't overwrite an existing eval result in the model index
@Wauplin would two entries with different source be displayed differently?
@Weyaxi I think editing your previous PRs would be the best - maybe it would be possible to simply add a new commit to them?
Should the bot open new pull requests (PRs) for every model in the leaderboard again, or just from now on? (Edit existing PRs maybe? I know it is possible, but I currently do not know how to code that.)
I think editing existing PRs would be best yes. And maybe start with 10, see how it goes and then continue.
To push changes to a PR, you need to use the revision
parameter in the push_to_hub
method of the ModelCard object. You can also edit the readme locally and use upload_file
to upload the readme to the correct revision. Revision for a PR is always "refs/pr/<pr_num>"
(e.g. "refs/pr/1"
, "refs/pr/2"
, "refs/pr/3"
, etc.). Finally, you can use get_repo_discussions
to list existing PRs and discussion from a repo. This will help you retrieve the existing PRs on each repo.
Hope this will help!
Well, I am confused. Are you trying to say that the bot shouldn't open PRs for models that have already reported their own results (e.g., zephyr-7b-beta)?
Yes for a model like zephyr-7b-beta
, no need to report the eval results again since all Open LLM Results are already reported.
Now let's take the example of Phind/Phind-CodeLlama-34B-v2. The model is evaluated on the Open LLM Leaderboard so it would be good to open a PR for it. But the model card already contain reported evaluation on the HumanEval
benchmark. The PR should not remove this existing evaluation when adding the new ones.
It was just a "let's be careful" message :D
I think editing existing PRs would be best yes. And maybe start with 10, see how it goes and then continue.
To push changes to a PR, you need to use therevision
parameter in thepush_to_hub
method of the ModelCard object. You can also edit the readme locally and useupload_file
to upload the readme to the correct revision. Revision for a PR is always"refs/pr/<pr_num>"
(e.g."refs/pr/1"
,"refs/pr/2"
,"refs/pr/3"
, etc.). Finally, you can useget_repo_discussions
to list existing PRs and discussion from a repo. This will help you retrieve the existing PRs on each repo.Hope this will help!
This will be pretty helpful! Thanks for these resources.
Yes for a model like
zephyr-7b-beta
, no need to report the eval results again since all Open LLM Results are already reported.
Now let's take the example of Phind/Phind-CodeLlama-34B-v2. The model is evaluated on the Open LLM Leaderboard so it would be good to open a PR for it. But the model card already contain reported evaluation on theHumanEval
benchmark. The PR should not remove this existing evaluation when adding the new ones.It was just a "let's be careful" message :D
Okay, now I understand more clearly. Thanks for that explanation :)
Hi
@Weyaxi
!
Is all good on your side? Do you need a hand on something?
I am a little busy these days but will try to work on this today and tomorrow.
There is no hurry, take your time! Just wanted to check if you needed some help :)
This is super cool folks. Would it be possible to also document what sampling and temperature settings were used per model / task (assuming they vary, if they are the same, it would be great if it were documented somewhere). It will make building further based on these benchmarks so much easier.