@lbourdois on Hugging Face: "Let me introduce you LLE: Leaks, leaks everywhere! A quick experiment I've…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

lbourdois

posted an update Feb 1, 2024

Post

Let me introduce you LLE: Leaks, leaks everywhere!

A quick experiment I've carried out on around 600 datasets from the HF Hub, the results are stored in lbourdois/LLE, and the methodology is described in
https://huggingface.co./blog/lbourdois/lle

tomaarsen

Feb 1, 2024

I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.

lbourdois

Feb 1, 2024

It's the exchanges I've had with you that have led me to question the quality of the data 🤗

On which desk in the Paris office should I leave a post-it note asking for the creation of the bot?

dhuynh95

Feb 1, 2024

Pretty cool stuff! Maybe you should do a leaderboard of major datasets and their leakage score

lastrosade

Feb 7, 2024

•

edited Feb 7, 2024

A little glossary would be nice, I'm not even sure what NER is or what a "leak" means.

lbourdois

Feb 9, 2024

For NER (Name Entity Recognition) you can consult https://huggingface.co./tasks/token-classification.
A leak is when data of the train split is found in the test split, biasing the results and benchmarks.

In this post