Spaces:

leafspark
/

BaseBench

Running

App Files Files Community

leafspark commited on Jul 27

Commit

ca539f3

•

1 Parent(s): 1dc1fab

add dataset card

Browse files

Files changed (1) hide show

index.html +31 -0

index.html CHANGED Viewed

@@ -54,6 +54,30 @@
     <script>
         const markdown = `
 | Rank | Model                              | Accuracy           | Time  | Speed     |
 |------|------------------------------------|--------------------|-------|-----------|
 | 1    | openai/gpt-4o                      | 59.00% (1475/2500) | 03:17 | 12.66it/s |
@@ -67,6 +91,13 @@
 | 9    | 01-ai/yi-large                     | 20.68% (517/2500)  | 02:37 | 15.83it/s |
 | 10   | mistralai/mixtral-8x22b-instruct   | 19.60% (490/2500)  | 04:32 | 9.18it/s  |
 | 11   | meta-llama/llama-3.1-70b-instruct  | 19.04% (476/2500)  | 18:01 | 2.31it/s  |
         `;
         document.addEventListener('DOMContentLoaded', function() {

     <script>
         const markdown = `
+BaseBench: A Foundational Language Model Evaluation Framework
+Description:
+BaseBench is a targeted evaluation framework designed to assess the fundamental capabilities of large language models across a spectrum of basic yet crucial tasks. This suite focuses on core competencies that serve as building blocks for more complex language understanding and generation.
+**Features**:
+1. Encoding/Decoding Proficiency: Tests the model's ability to work with common encoding schemes like Base64 and ROT13, evaluating its understanding of data representation and transformation.
+2. Basic Mathematical Reasoning: Assesses the model's capacity to perform simple arithmetic operations and mathematical problem-solving, gauging its numerical processing capabilities.
+3. Linguistic Analysis: Examines the model's grasp of fundamental language properties such as character counting and frequency analysis, probing its understanding of word structure and composition.
+4. Error Detection and Correction: Challenges the model to identify and rectify typographical errors, testing its language pattern recognition and error handling abilities (tokenization).
+**Purpose**:
+BaseBench aims to provide a clear, quantifiable measure of a language model's proficiency in these foundational areas. By focusing on these essential skills, the benchmark offers:
+1. A standardized baseline for comparing different models or versions.
+2. Insight into a model's fundamental processing capabilities.
+3. A tool for identifying potential gaps in basic language and data handling skills.
+4. A means to track incremental improvements in core model competencies.
+5. Difficult enough to avoid saturation
 | Rank | Model                              | Accuracy           | Time  | Speed     |
 |------|------------------------------------|--------------------|-------|-----------|
 | 1    | openai/gpt-4o                      | 59.00% (1475/2500) | 03:17 | 12.66it/s |
 | 9    | 01-ai/yi-large                     | 20.68% (517/2500)  | 02:37 | 15.83it/s |
 | 10   | mistralai/mixtral-8x22b-instruct   | 19.60% (490/2500)  | 04:32 | 9.18it/s  |
 | 11   | meta-llama/llama-3.1-70b-instruct  | 19.04% (476/2500)  | 18:01 | 2.31it/s  |
+**Insights**:
+- GPT models lead (only Anthropic's flagship manages to beat 4o-mini)
+- Mistral Large is an outlier, however it beats GPT-4o-mini easily (also corresponding to the MMLU-Pro score)
+- Llama models score fairly low
+- Closed source models/proprietry tend to score better (Mistral Large), due to training differences?
+- Gemini is fast, however quality is comparable to Gemma
         `;
         document.addEventListener('DOMContentLoaded', function() {