WildBench - a allenai Collection

allenai 's Collections

olmOCR

OLMoE (January 2025)

PixMo

OLMo 2

Tulu 3 Datasets

Molmo

OLMoE (November 2024)

Tulu V2.5 Suite

Paloma

SciRIFF

AI2 Safety Toolkit

Zebra Logic Bench

OLMo 2 Preview Post-trained Models

ACE

WildBench

updated 18 days ago

Running

222

222

AI2 WildBench Leaderboard (V2)

🦁

Display and explore model leaderboards and chat history

Note The leaderboard for visualizing the results and collecting human feedback.
allenai/WildBench

Viewer • Updated Nov 4, 2024 • 2.3k • 1.56k • 35

Note Examples for evaluating LLMs.
allenai/WildBench-V2-Model-Outputs

Viewer • Updated Aug 1, 2024 • 62.5k • 3.66k • 2

Note The model outputs for verified LLMs on the leaderboard.
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Paper • 2406.04770 • Published Jun 7, 2024 • 28