MTEB Leaderboard : User guide and best practices
Mathieu Ciancone, Imene Kerboua, Marion Schaeffer, Gabriel Sequeira, and Wissam Siblini
MTEB [1] is a multi-task and multi-language comparison of embedding models. It comes in the form of a leaderboard, based on multiple scores, and only one model stands at the top! Does it make it easy to choose the right model for your application? You wish! This guide is an attempt to provide tips on how to make clever use of MTEB. As our team worked on making the French benchmark available [2], the examples will rely on the French MTEB. Nonetheless, those tips apply to the entire benchmark.
Don't let all those scores abuse you
MTEB is a leaderboard. It shows you scores. What it doesn't show you? Significance.
While being a great resource for discovering and comparing models, MTEB might not be as straightforward as one might expect. As of today (1st of March 2024), many SOTA models have been tested, and most of them display close average scores. For the French MTEB, these average scores are computed on 26 different tasks (and 56 for English MTEB!) and no standard deviation comes with it. Even though the top model looks better than the others, the score difference with a model that comes after it might not be significant. One can directly get the raw results to compute statistical metrics. As an example, we performed critical difference tests and found that, with a p-value of 0.05, the current 9 top models in the French MTEB leaderboard are statistically equivalent. It would require even more datasets to perceive statistical significance.
Dive into data
Do not just look at the average scores of models on the task you are interested in. Instead, look at the individual scores on the datasets that best represent your use case.
Let's say you are interested in doing retrieval for an application in the legal domain. The legal language is very specific, and models that have been trained on general-purpose data might perform poorly in your application. Consequently, instead of just taking the best model of MTEB for the retrieval task, look specifically at the tasks that involve datasets containing legal-related content. Indeed, the scores on datasets such as Syntec or BSARD [3] might be more representative of the models' performance on your data. No need for a deep dive into the dataset's data, most of the time the dataset's card and a short preview will tell you everything you need.
Consider the model's characteristics
Using the model displaying the best average score for your application might be tempting. However, a model comes with its characteristics, leading to its usage constraints. Make sure those constraints match yours.
Before going for the top-of-the-list model, check the model's characteristics: How many tokens does its context window handle? What is the size of the model? Has it been trained on multilingual data? Indeed, those features may conflict with the constraints of your application. For example, here are a few things you might want to consider:
- compute capacity: If you need to run experiments on a single-GPU laptop, do not select 7B parameters models, smaller models might be more adapted.
- input token sequence length: If you need to perform sentence similarity on small sentences, no need for a context window of 32k tokens, the 128 token window of sentence-camembert-large might be enough.
- storage capacity/latency: If you hesitate between the base and large flavours of multilingual-e5-large [4], ask yourself if the 1 point difference on retrieval average score is worth going from a 1.1 GB to a 2.2 GB model and a 728 to a 1024 output embedding size.
Do not forget MTEB is a leaderboard...
And as leaderboards sometimes do, it encourages to compete without following the rules.
Indeed, keep in mind that many providers want to see their models on top of that list and, being based on public datasets, some malpractices such as data leakage or overfitting on data could bias MTEB evaluation. So if you find a model that seems good to you, look further into the model's development and training setup: many contributors make the effort to provide a detailed model card specifying the model's training data, purpose or licence. Look into those details, pick the model you are interested in as well as 2 or 3 secondary choices, and run some tests on your data to see if it still fits.
How to evaluate your model?
You might want to know how your custom model performs against other models in the leaderboard.
It's easy! Just follow the steps described in this blog. By running this, you should see you model performance. And if you want, feel free to add the results to your model card in order to add them to the leaderboard.
Thank you for reading and happy MTEB browsing! 🤗
Bibliography
[1] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
[2] https://huggingface.co./blog/lyon-nlp-group/french-mteb-datasets
[3] Antoine Louis and Gerasimos Spanakis. 2022. A Statutory Article Retrieval Dataset in French. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6789–6803, Dublin, Ireland. Association for Computational Linguistics.
[4] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder and Furu Wei. Text Embeddings by Weakly-Supervised Contrastive Pre-training. ArXiv abs/2212.03533 (2022): n. pag.