puneeshkhanna commited on
Commit
8799bc6
·
verified ·
1 Parent(s): d92e3ec

Add official leaderboard eval comparison

Browse files
Files changed (1) hide show
  1. README.md +66 -4
README.md CHANGED
@@ -182,13 +182,75 @@ print(response)
182
  <br>
183
 
184
  ## Benchmarks
185
- We report in the following table our internal pipeline benchmarks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
187
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
188
  - We use same batch-size across all models.
189
 
190
-
191
-
192
  <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
193
  <colgroup>
194
  <col style="width: 10%;">
@@ -202,7 +264,7 @@ We report in the following table our internal pipeline benchmarks.
202
  <th>Category</th>
203
  <th>Benchmark</th>
204
  <th>Yi-1.5-9B-Chat</th>
205
- <th>Mistral-Nemo-Base-2407 (12B)</th>
206
  <th>Falcon3-10B-Instruct</th>
207
  </tr>
208
  </thead>
 
182
  <br>
183
 
184
  ## Benchmarks
185
+ We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
186
+ <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
187
+ <colgroup>
188
+ <col style="width: 10%;">
189
+ <col style="width: 7%;">
190
+ <col style="width: 7%;">
191
+ <col style="width: 7%;">
192
+ <col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
193
+ </colgroup>
194
+ <thead>
195
+ <tr>
196
+ <th>Benchmark</th>
197
+ <th>Yi-1.5-9B-Chat</th>
198
+ <th>Mistral-Nemo-Instruct-2407 (12B)</th>
199
+ <th>Gemma-2-9b-it</th>
200
+ <th>Falcon3-10B-Instruct</th>
201
+ </tr>
202
+ </thead>
203
+ <tbody>
204
+ <tr>
205
+ <td>IFEval</td>
206
+ <td>60.46</td>
207
+ <td>63.80</td>
208
+ <td>74.36</td>
209
+ <td><b>78.17</b></td>
210
+ </tr>
211
+ <tr>
212
+ <td>BBH (3-shot)</td>
213
+ <td>36.95</td>
214
+ <td>29.68</td>
215
+ <td>42.14</td>
216
+ <td><b>44.82</b></td>
217
+ </tr>
218
+ <tr>
219
+ <td>MATH Lvl-5 (4-shot)</td>
220
+ <td>12.76</td>
221
+ <td>6.50</td>
222
+ <td>0.23</td>
223
+ <td><b>25.91</b></td>
224
+ </tr>
225
+ <tr>
226
+ <td>GPQA (0-shot)</td>
227
+ <td>11.30</td>
228
+ <td>5.37</td>
229
+ <td><b>14.77</b></td>
230
+ <td>10.51</td>
231
+ </tr>
232
+ <tr>
233
+ <td>MUSR (0-shot)</td>
234
+ <td>12.84</td>
235
+ <td>8.48</td>
236
+ <td>9.74</td>
237
+ <td><b>13.61</b></td>
238
+ </tr>
239
+ <tr>
240
+ <td>MMLU-PRO (5-shot)</td>
241
+ <td>33.06</td>
242
+ <td>27.97</td>
243
+ <td>31.95</td>
244
+ <td><b>38.10</b></td>
245
+ </tr>
246
+ </tbody>
247
+ </table>
248
+
249
+ Also, we report in the following table our internal pipeline benchmarks.
250
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
251
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
252
  - We use same batch-size across all models.
253
 
 
 
254
  <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
255
  <colgroup>
256
  <col style="width: 10%;">
 
264
  <th>Category</th>
265
  <th>Benchmark</th>
266
  <th>Yi-1.5-9B-Chat</th>
267
+ <th>Mistral-Nemo-Instruct-2407 (12B)</th>
268
  <th>Falcon3-10B-Instruct</th>
269
  </tr>
270
  </thead>