Leader Board Test Org

AI & ML interests

None defined yet.

Recent Activity

leaderboard-test's activity

freddyaboulton 
posted an update about 1 month ago
freddyaboulton 
posted an update about 1 month ago
freddyaboulton 
posted an update about 1 month ago
view post
Post
2044
Version 0.0.21 of gradio-pdf now properly loads chinese characters!
freddyaboulton 
posted an update about 1 month ago
view post
Post
1567
Hello Llama 3.2! 🗣️🦙

Build a Siri-like coding assistant that responds to "Hello Llama" in 100 lines of python! All with Gradio, webRTC 😎

freddyaboulton/hey-llama-code-editor
freddyaboulton 
posted an update about 2 months ago
freddyaboulton 
posted an update 7 months ago
clefourrier 
posted an update 9 months ago
view post
Post
5647
In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co./blog/leaderboard-medicalllm
clefourrier 
posted an update 9 months ago
view post
Post
4630
Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co./blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!
clefourrier 
posted an update 9 months ago
view post
Post
2216
🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard
freddyaboulton 
posted an update 10 months ago
view post
Post
3666
We just released gradio version 4.26.0 ! We *highly* recommend you upgrade your apps to this version to bring in these nice changes:

🎥 Introducing the API recorder. Any gradio app running 4.26.0 and above will have an "API Recorder" that will record your interactions with the app and auto-generate the corresponding python or js code needed to recreate those actions programmatically. It's very neat!

📝 Enhanced markdown rendering in gr.Chatbot

🐢 Fix for slow load times on spaces as well as the UI locking up on rapid generations

See the full changelog of goodies here: https://www.gradio.app/changelog#4-26-0
  • 1 reply
·
clefourrier 
posted an update 10 months ago
view post
Post
2223
Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...
freddyaboulton 
posted an update 10 months ago
view post
Post
2437
Gradio 4.25.0 is out with some nice improvements and bug fixes!

🧹 Automatic deletion of gr.State variables stored in the server. Never run out of RAM again. Also adds an unload event you can run when a user closes their browser tab.

😴 Lazy example caching. You can set cache_examples="lazy" to cache examples when they're first requested as opposed to before the server launches. This can cut down the server's start-up time drastically.

🔊 Fixes a bug with streaming audio outputs

🤖 Improvements to gr.ChatInterface like pasting images directly from the clipboard.

See the rest of the changelog here: https://www.gradio.app/changelog#4-25-0
clefourrier 
posted an update 10 months ago
view post
Post
2357
Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.
·
clefourrier 
posted an update 10 months ago
view post
Post
2014
Are you looking for the perfect leaderboard/arena for your use case? 👀

There's a new tool for this!
https://huggingface.co./spaces/leaderboards/LeaderboardFinder

Select your modality, language, task... then search! 🔍
Some categories of interest:
- does the leaderboard accept submissions?
- is the test set private or public?
- is it using an automatic metric, human evaluators, or llm as a judge?

The spaces list is build from space metadata, and reloaded every hour.

Enjoy!
clefourrier 
posted an update 10 months ago
view post
Post
1528
How talkative is your chatbot about your internal data? 😬

As more chatbots get deployed in production, with access to internal databases, we need to make sure they don't leak private information to anyone interacting with them.

The Lighthouz AI team therefore introduced the Chatbot Guardrails Arena to stress test models and see how well guarded your private information is.
Anyone can try to make models reveal information they should not share 😈
(which is quite fun to do for the strongest models)!

The votes will then be gathered to create an Elo ranking of the safest models with respect to PII.

In the future, with the support of the community, this arena could inform safety choices that company make, when choosing models and guardrails on their resistance to adversarial attacks.
It's also a good way to easily demonstrate the limitations of current systems!

Check out the arena: lighthouzai/guardrails-arena
Learn more in the blog: https://huggingface.co./blog/arena-lighthouz
freddyaboulton 
posted an update 10 months ago
view post
Post
1701
Tips for saving disk space with Gradio 💾

Try these out with gradio 4.22.0 ! Code snippet attached.

1. Set delete_cache. The delete_cache parameter will periodically delete files from gradio's cache that are older than a given age. Setting it will also delete all files created by that app when the app shuts down. It is a tuple of two ints, (frequency, age) expressed in seconds. So delete_cache=(3600, 3600), will delete files older than an hour every hour.

2. Use static files. Static files are not copied to the cache and are instead served directly to users of your app. This is useful for components displaying a lot of content that won't change, like a gallery with hundreds of images.

3. Set format="jpeg" for images and galleries. JPEGs take up less disk space than PNGs. This can also speed up the speed of your prediction function as they will be written to the cache faster.

·
clefourrier 
posted an update 11 months ago
view post
Post
🔥 New multimodal leaderboard on the hub: ConTextual!

Many situations require models to parse images containing text: maps, web pages, real world pictures, memes, ... 🖼️
So how do you evaluate performance on this task?

The ConTextual team introduced a brand new dataset of instructions and images, to test LMM (large multimodal models) reasoning capabilities, and an associated leaderboard (with a private test set).

This is super exciting imo because it has the potential to be a good benchmark both for multimodal models and for assistants' vision capabilities, thanks to the instructions in the dataset.

Congrats to @rohan598 , @hbXNov , @kaiweichang and @violetpeng !!

Learn more in the blog: https://huggingface.co./blog/leaderboard-contextual
Leaderboard: ucla-contextual/contextual_leaderboard