How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.
🚨 How green is your model? 🌱 Introducing a new feature in the Comparator tool: Environmental Impact for responsible #LLM research! 👉 open-llm-leaderboard/comparator Now, you can not only compare models by performance, but also by their environmental footprint!
🌍 The Comparator calculates CO₂ emissions during evaluation and shows key model characteristics: evaluation score, number of parameters, architecture, precision, type... 🛠️ Make informed decisions about your model's impact on the planet and join the movement towards greener AI!
🚀 New feature of the Comparator of the 🤗 Open LLM Leaderboard: now compare models with their base versions & derivatives (finetunes, adapters, etc.). Perfect for tracking how adjustments affect performance & seeing innovations in action. Dive deeper into the leaderboard!
🛠️ Here's how to use it: 1. Select your model from the leaderboard. 2. Load its model tree. 3. Choose any base & derived models (adapters, finetunes, merges, quantizations) for comparison. 4. Press Load. See side-by-side performance metrics instantly!
Ready to dive in? 🏆 Try the 🤗 Open LLM Leaderboard Comparator now! See how models stack up against their base versions and derivatives to understand fine-tuning and other adjustments. Easier model analysis for better insights! Check it out here: open-llm-leaderboard/comparator 🌐
Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?
🚨 Instruct-tuning impacts models differently across families! Qwen2.5-72B-Instruct excels on IFEval but struggles with MATH-Hard, while Llama-3.1-70B-Instruct avoids MATH performance loss! Why? Can they follow the format in examples? 📊 Compare models: open-llm-leaderboard/comparator