tiiuae
/

Falcon3-10B-Instruct

@@ -182,13 +182,75 @@ print(response)
 <br>
 ## Benchmarks
-We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
 <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
     <colgroup>
         <col style="width: 10%;">
@@ -202,7 +264,7 @@ We report in the following table our internal pipeline benchmarks.
             <th>Category</th>
             <th>Benchmark</th>
             <th>Yi-1.5-9B-Chat</th>
-            <th>Mistral-Nemo-Base-2407 (12B)</th>
             <th>Falcon3-10B-Instruct</th>
         </tr>
     </thead>

 <br>
 ## Benchmarks
+We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
+<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
+    <colgroup>
+        <col style="width: 10%;">
+        <col style="width: 7%;">
+        <col style="width: 7%;">
+        <col style="width: 7%;">
+        <col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
+    </colgroup>
+    <thead>
+        <tr>
+            <th>Benchmark</th>
+            <th>Yi-1.5-9B-Chat</th>
+            <th>Mistral-Nemo-Instruct-2407 (12B)</th>
+            <th>Gemma-2-9b-it</th>
+            <th>Falcon3-10B-Instruct</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>IFEval</td>
+            <td>60.46</td>
+            <td>63.80</td>
+            <td>74.36</td>
+            <td><b>78.17</b></td>
+        </tr>
+        <tr>
+            <td>BBH (3-shot)</td>
+            <td>36.95</td>
+            <td>29.68</td>
+            <td>42.14</td>
+            <td><b>44.82</b></td>
+        </tr>
+        <tr>
+            <td>MATH Lvl-5 (4-shot)</td>
+            <td>12.76</td>
+            <td>6.50</td>
+            <td>0.23</td>
+            <td><b>25.91</b></td>
+        </tr>
+        <tr>
+            <td>GPQA (0-shot)</td>
+            <td>11.30</td>
+            <td>5.37</td>
+            <td><b>14.77</b></td>
+            <td>10.51</td>
+        </tr>
+        <tr>
+            <td>MUSR (0-shot)</td>
+            <td>12.84</td>
+            <td>8.48</td>
+            <td>9.74</td>
+            <td><b>13.61</b></td>
+        </tr>
+        <tr>
+            <td>MMLU-PRO (5-shot)</td>
+            <td>33.06</td>
+            <td>27.97</td>
+            <td>31.95</td>
+            <td><b>38.10</b></td>
+        </tr>
+    </tbody>
+</table>
+Also, we report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
 <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
     <colgroup>
         <col style="width: 10%;">
             <th>Category</th>
             <th>Benchmark</th>
             <th>Yi-1.5-9B-Chat</th>
+            <th>Mistral-Nemo-Instruct-2407 (12B)</th>
             <th>Falcon3-10B-Instruct</th>
         </tr>
     </thead>