puneeshkhanna
commited on
Add official leaderboard eval comparison
Browse files
README.md
CHANGED
@@ -182,13 +182,75 @@ print(response)
|
|
182 |
<br>
|
183 |
|
184 |
## Benchmarks
|
185 |
-
We report
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
186 |
- We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
187 |
- We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
|
188 |
- We use same batch-size across all models.
|
189 |
|
190 |
-
|
191 |
-
|
192 |
<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
|
193 |
<colgroup>
|
194 |
<col style="width: 10%;">
|
@@ -202,7 +264,7 @@ We report in the following table our internal pipeline benchmarks.
|
|
202 |
<th>Category</th>
|
203 |
<th>Benchmark</th>
|
204 |
<th>Yi-1.5-9B-Chat</th>
|
205 |
-
<th>Mistral-Nemo-
|
206 |
<th>Falcon3-10B-Instruct</th>
|
207 |
</tr>
|
208 |
</thead>
|
|
|
182 |
<br>
|
183 |
|
184 |
## Benchmarks
|
185 |
+
We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
|
186 |
+
<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
|
187 |
+
<colgroup>
|
188 |
+
<col style="width: 10%;">
|
189 |
+
<col style="width: 7%;">
|
190 |
+
<col style="width: 7%;">
|
191 |
+
<col style="width: 7%;">
|
192 |
+
<col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
|
193 |
+
</colgroup>
|
194 |
+
<thead>
|
195 |
+
<tr>
|
196 |
+
<th>Benchmark</th>
|
197 |
+
<th>Yi-1.5-9B-Chat</th>
|
198 |
+
<th>Mistral-Nemo-Instruct-2407 (12B)</th>
|
199 |
+
<th>Gemma-2-9b-it</th>
|
200 |
+
<th>Falcon3-10B-Instruct</th>
|
201 |
+
</tr>
|
202 |
+
</thead>
|
203 |
+
<tbody>
|
204 |
+
<tr>
|
205 |
+
<td>IFEval</td>
|
206 |
+
<td>60.46</td>
|
207 |
+
<td>63.80</td>
|
208 |
+
<td>74.36</td>
|
209 |
+
<td><b>78.17</b></td>
|
210 |
+
</tr>
|
211 |
+
<tr>
|
212 |
+
<td>BBH (3-shot)</td>
|
213 |
+
<td>36.95</td>
|
214 |
+
<td>29.68</td>
|
215 |
+
<td>42.14</td>
|
216 |
+
<td><b>44.82</b></td>
|
217 |
+
</tr>
|
218 |
+
<tr>
|
219 |
+
<td>MATH Lvl-5 (4-shot)</td>
|
220 |
+
<td>12.76</td>
|
221 |
+
<td>6.50</td>
|
222 |
+
<td>0.23</td>
|
223 |
+
<td><b>25.91</b></td>
|
224 |
+
</tr>
|
225 |
+
<tr>
|
226 |
+
<td>GPQA (0-shot)</td>
|
227 |
+
<td>11.30</td>
|
228 |
+
<td>5.37</td>
|
229 |
+
<td><b>14.77</b></td>
|
230 |
+
<td>10.51</td>
|
231 |
+
</tr>
|
232 |
+
<tr>
|
233 |
+
<td>MUSR (0-shot)</td>
|
234 |
+
<td>12.84</td>
|
235 |
+
<td>8.48</td>
|
236 |
+
<td>9.74</td>
|
237 |
+
<td><b>13.61</b></td>
|
238 |
+
</tr>
|
239 |
+
<tr>
|
240 |
+
<td>MMLU-PRO (5-shot)</td>
|
241 |
+
<td>33.06</td>
|
242 |
+
<td>27.97</td>
|
243 |
+
<td>31.95</td>
|
244 |
+
<td><b>38.10</b></td>
|
245 |
+
</tr>
|
246 |
+
</tbody>
|
247 |
+
</table>
|
248 |
+
|
249 |
+
Also, we report in the following table our internal pipeline benchmarks.
|
250 |
- We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
251 |
- We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
|
252 |
- We use same batch-size across all models.
|
253 |
|
|
|
|
|
254 |
<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
|
255 |
<colgroup>
|
256 |
<col style="width: 10%;">
|
|
|
264 |
<th>Category</th>
|
265 |
<th>Benchmark</th>
|
266 |
<th>Yi-1.5-9B-Chat</th>
|
267 |
+
<th>Mistral-Nemo-Instruct-2407 (12B)</th>
|
268 |
<th>Falcon3-10B-Instruct</th>
|
269 |
</tr>
|
270 |
</thead>
|