chrisociepa
commited on
Commit
•
ce2b447
1
Parent(s):
86c3e30
Update README.md
Browse files
README.md
CHANGED
@@ -119,7 +119,7 @@ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/s
|
|
119 |
- Reader (Generator) - open book question answering task, commonly used in RAG
|
120 |
- Perplexity (lower is better) - as a bonus, does not correlate with other scores and should not be used for model comparison
|
121 |
|
122 |
-
|
123 |
|
124 |
| | Average | RAG Reranking | RAG Reader | Perplexity |
|
125 |
|--------------------------------------------------------------------------------------|----------:|--------------:|-----------:|-----------:|
|
@@ -137,7 +137,7 @@ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/s
|
|
137 |
| mistralai/Mistral-7B-Instruct-v0.2 | 40.29 | 72.58 | 79.39 | 2088.08 |
|
138 |
| teknium/OpenHermes-2.5-Mistral-7B | 42.64 | 70.63 | 80.25 | 1463.00 |
|
139 |
| openchat/openchat-3.5-1210 | 44.17 | 71.76 | 82.15 | 1923.83 |
|
140 |
-
| speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e1_70c70cc6
|
141 |
| Nexusflow/Starling-LM-7B-beta | 45.69 | 74.58 | 81.22 | 1161.54 |
|
142 |
| openchat/openchat-3.5-0106 | 47.32 | 74.71 | 83.60 | 1106.56 |
|
143 |
| berkeley-nest/Starling-LM-7B-alpha | **47.46** | **75.73** | 82.86 | 1438.04 |
|
@@ -155,13 +155,14 @@ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/s
|
|
155 |
| mistralai/Mistral-7B-v0.1 | 30.67 | 60.35 | 85.39 | 857.32 |
|
156 |
| internlm/internlm2-7b | 33.03 | 69.39 | 73.63 | 5498.23 |
|
157 |
| alpindale/Mistral-7B-v0.2-hf | 33.05 | 60.23 | 85.21 | 932.60 |
|
158 |
-
| speakleash/mistral-apt3-7B/spi-e0_hf
|
159 |
|
160 |
SpeakLeash models have one of the best scores in the RAG Reader task.
|
161 |
We have managed to increase Average score by almost 9 pp. in comparison to Mistral-7B-v0.1.
|
162 |
In our subjective evaluations of chatting skills SpeakLeash models perform better than other models with higher Average scores.
|
163 |
|
164 |
-
|
|
|
165 |
|
166 |
## Limitations and Biases
|
167 |
|
@@ -212,7 +213,7 @@ The model could not have been created without the commitment and work of the ent
|
|
212 |
[Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/),
|
213 |
and many other wonderful researchers and enthusiasts of the AI world.
|
214 |
|
215 |
-
Members of the ACK Cyfronet AGH team:
|
216 |
[Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/).
|
217 |
|
218 |
## Contact Us
|
|
|
119 |
- Reader (Generator) - open book question answering task, commonly used in RAG
|
120 |
- Perplexity (lower is better) - as a bonus, does not correlate with other scores and should not be used for model comparison
|
121 |
|
122 |
+
As of April 3, 2024, the following table showcases the current scores of pretrained and continuously pretrained models according to the Open PL LLM Leaderboard, evaluated in a 5-shot setting:
|
123 |
|
124 |
| | Average | RAG Reranking | RAG Reader | Perplexity |
|
125 |
|--------------------------------------------------------------------------------------|----------:|--------------:|-----------:|-----------:|
|
|
|
137 |
| mistralai/Mistral-7B-Instruct-v0.2 | 40.29 | 72.58 | 79.39 | 2088.08 |
|
138 |
| teknium/OpenHermes-2.5-Mistral-7B | 42.64 | 70.63 | 80.25 | 1463.00 |
|
139 |
| openchat/openchat-3.5-1210 | 44.17 | 71.76 | 82.15 | 1923.83 |
|
140 |
+
| speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e1_70c70cc6 (experimental) | 45.44 | 71.27 | 91.50 | 279.24 |
|
141 |
| Nexusflow/Starling-LM-7B-beta | 45.69 | 74.58 | 81.22 | 1161.54 |
|
142 |
| openchat/openchat-3.5-0106 | 47.32 | 74.71 | 83.60 | 1106.56 |
|
143 |
| berkeley-nest/Starling-LM-7B-alpha | **47.46** | **75.73** | 82.86 | 1438.04 |
|
|
|
155 |
| mistralai/Mistral-7B-v0.1 | 30.67 | 60.35 | 85.39 | 857.32 |
|
156 |
| internlm/internlm2-7b | 33.03 | 69.39 | 73.63 | 5498.23 |
|
157 |
| alpindale/Mistral-7B-v0.2-hf | 33.05 | 60.23 | 85.21 | 932.60 |
|
158 |
+
| speakleash/mistral-apt3-7B/spi-e0_hf (experimental) | 35.50 | 62.14 | **87.48** | 132.78 |
|
159 |
|
160 |
SpeakLeash models have one of the best scores in the RAG Reader task.
|
161 |
We have managed to increase Average score by almost 9 pp. in comparison to Mistral-7B-v0.1.
|
162 |
In our subjective evaluations of chatting skills SpeakLeash models perform better than other models with higher Average scores.
|
163 |
|
164 |
+
The results in the above table were obtained without utilizing instruction templates for instructional models, instead treating them like base models.
|
165 |
+
This approach could skew the results, as instructional models are optimized with specific instructions in mind.
|
166 |
|
167 |
## Limitations and Biases
|
168 |
|
|
|
213 |
[Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/),
|
214 |
and many other wonderful researchers and enthusiasts of the AI world.
|
215 |
|
216 |
+
Members of the ACK Cyfronet AGH team providing valuable support and expertise:
|
217 |
[Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/).
|
218 |
|
219 |
## Contact Us
|