Spaces:
Running
on
CPU Upgrade
Eval time vs. score diagram
On the Portuguese version of the old/'v1' Open LLM Leaderboard I saw an interesting plot.
See the Metrics tab, and look at the bottom: https://huggingface.co./spaces/eduagarcia/open_pt_llm_leaderboard
There you can kind of oggle the scaling laws. Also, that around 9B the models can ace these older style tests.
Maybe add something like that, or one size vs. score; instead of evaluation time.
Hi @HenkPoley ,
This is a very good idea! We're a bit short on time at the moment, would you be interested in contributing this feature?
some of the notable models that performed well in Portuguese are
THUDM/glm-4-9b-chat-1m
THUDM/glm-4-9b-chat
THUDM/glm-4-9b
but unfortunately they trigger the error message: “needs to be launched with trust_remote_code=True”
could the model be changed to somehow mitigate this? what are the prospects?
Hi @CombinHorizon ,
Currently we have results for THUDM/glm-4-9b
and THUDM/glm-4-9b-chat
that we added manually, you can find them on the Leaderboard. If you're interested, we can also add THUDM/glm-4-9b-chat-1m
as well
Closing this discussion due to inactivity, feel free to ping me here if you want to continue discussing the plot implementation