Crowdsourced LLM Benchmark

Benchmarks like HellaSwag are a bit too abstract for me to get a sense of how well they perform in real-world workflows.

I had the idea of writing a script that asks prompts testing basic reasoning, instruction following, and creativity on around 60 models that I could get my hands on through inferences API.

The script stored all the answers in a SQLite database, and those are the raw results.

{`view: `} changeView("model")}> models {" "} / changeView("prompt")}> prompts {" "}

{viewBy === "prompt" ? ( <> {types.map((type, k) => (

{type}:

                          {prompt.text}
                          

                          

                          results

))} ) : (

{model.name} -{" "} results {" "} - score: {model.score}

)}

Notes

I used a temperature of 0 and a max token limit of 240 for each test (that's why a lot of answers are cropped). The rest are default settings.
I made this with a mix of APIs from OpenRouter, TogetherAI, OpenAI, Cohere, Aleph Alpha & AI21.
This is imperfect. I want to improve this by using better stop sequences and prompt formatting tailored to each model. But hopefully it can already make picking models a bit easier.
Ideas for the future: public votes to compute an ELO rating, compare 2 models side by side, community-submitted prompts (open to suggestions)
Prompt suggestions, feedback or say hi: vince [at] llmonitor.com
{`Shameless plug: I'm building an `} open-source observability tool for AI devs.

Edit: as this got popular, I added an email form to receive notifications for future benchmark results: (no spam, max 1 email per month)