leafspark commited on
Commit
ca539f3
1 Parent(s): 1dc1fab

add dataset card

Browse files
Files changed (1) hide show
  1. index.html +31 -0
index.html CHANGED
@@ -54,6 +54,30 @@
54
 
55
  <script>
56
  const markdown = `
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  | Rank | Model | Accuracy | Time | Speed |
58
  |------|------------------------------------|--------------------|-------|-----------|
59
  | 1 | openai/gpt-4o | 59.00% (1475/2500) | 03:17 | 12.66it/s |
@@ -67,6 +91,13 @@
67
  | 9 | 01-ai/yi-large | 20.68% (517/2500) | 02:37 | 15.83it/s |
68
  | 10 | mistralai/mixtral-8x22b-instruct | 19.60% (490/2500) | 04:32 | 9.18it/s |
69
  | 11 | meta-llama/llama-3.1-70b-instruct | 19.04% (476/2500) | 18:01 | 2.31it/s |
 
 
 
 
 
 
 
70
  `;
71
 
72
  document.addEventListener('DOMContentLoaded', function() {
 
54
 
55
  <script>
56
  const markdown = `
57
+ BaseBench: A Foundational Language Model Evaluation Framework
58
+
59
+ Description:
60
+ BaseBench is a targeted evaluation framework designed to assess the fundamental capabilities of large language models across a spectrum of basic yet crucial tasks. This suite focuses on core competencies that serve as building blocks for more complex language understanding and generation.
61
+
62
+ **Features**:
63
+
64
+ 1. Encoding/Decoding Proficiency: Tests the model's ability to work with common encoding schemes like Base64 and ROT13, evaluating its understanding of data representation and transformation.
65
+
66
+ 2. Basic Mathematical Reasoning: Assesses the model's capacity to perform simple arithmetic operations and mathematical problem-solving, gauging its numerical processing capabilities.
67
+
68
+ 3. Linguistic Analysis: Examines the model's grasp of fundamental language properties such as character counting and frequency analysis, probing its understanding of word structure and composition.
69
+
70
+ 4. Error Detection and Correction: Challenges the model to identify and rectify typographical errors, testing its language pattern recognition and error handling abilities (tokenization).
71
+
72
+ **Purpose**:
73
+ BaseBench aims to provide a clear, quantifiable measure of a language model's proficiency in these foundational areas. By focusing on these essential skills, the benchmark offers:
74
+
75
+ 1. A standardized baseline for comparing different models or versions.
76
+ 2. Insight into a model's fundamental processing capabilities.
77
+ 3. A tool for identifying potential gaps in basic language and data handling skills.
78
+ 4. A means to track incremental improvements in core model competencies.
79
+ 5. Difficult enough to avoid saturation
80
+
81
  | Rank | Model | Accuracy | Time | Speed |
82
  |------|------------------------------------|--------------------|-------|-----------|
83
  | 1 | openai/gpt-4o | 59.00% (1475/2500) | 03:17 | 12.66it/s |
 
91
  | 9 | 01-ai/yi-large | 20.68% (517/2500) | 02:37 | 15.83it/s |
92
  | 10 | mistralai/mixtral-8x22b-instruct | 19.60% (490/2500) | 04:32 | 9.18it/s |
93
  | 11 | meta-llama/llama-3.1-70b-instruct | 19.04% (476/2500) | 18:01 | 2.31it/s |
94
+
95
+ **Insights**:
96
+ - GPT models lead (only Anthropic's flagship manages to beat 4o-mini)
97
+ - Mistral Large is an outlier, however it beats GPT-4o-mini easily (also corresponding to the MMLU-Pro score)
98
+ - Llama models score fairly low
99
+ - Closed source models/proprietry tend to score better (Mistral Large), due to training differences?
100
+ - Gemini is fast, however quality is comparable to Gemma
101
  `;
102
 
103
  document.addEventListener('DOMContentLoaded', function() {