chrisociepa commited on
Commit
ce2b447
1 Parent(s): 86c3e30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -119,7 +119,7 @@ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/s
119
  - Reader (Generator) - open book question answering task, commonly used in RAG
120
  - Perplexity (lower is better) - as a bonus, does not correlate with other scores and should not be used for model comparison
121
 
122
-
123
 
124
  | | Average | RAG Reranking | RAG Reader | Perplexity |
125
  |--------------------------------------------------------------------------------------|----------:|--------------:|-----------:|-----------:|
@@ -137,7 +137,7 @@ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/s
137
  | mistralai/Mistral-7B-Instruct-v0.2 | 40.29 | 72.58 | 79.39 | 2088.08 |
138
  | teknium/OpenHermes-2.5-Mistral-7B | 42.64 | 70.63 | 80.25 | 1463.00 |
139
  | openchat/openchat-3.5-1210 | 44.17 | 71.76 | 82.15 | 1923.83 |
140
- | speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e1_70c70cc6 | 45.44 | 71.27 | 91.50 | 279.24 |
141
  | Nexusflow/Starling-LM-7B-beta | 45.69 | 74.58 | 81.22 | 1161.54 |
142
  | openchat/openchat-3.5-0106 | 47.32 | 74.71 | 83.60 | 1106.56 |
143
  | berkeley-nest/Starling-LM-7B-alpha | **47.46** | **75.73** | 82.86 | 1438.04 |
@@ -155,13 +155,14 @@ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/s
155
  | mistralai/Mistral-7B-v0.1 | 30.67 | 60.35 | 85.39 | 857.32 |
156
  | internlm/internlm2-7b | 33.03 | 69.39 | 73.63 | 5498.23 |
157
  | alpindale/Mistral-7B-v0.2-hf | 33.05 | 60.23 | 85.21 | 932.60 |
158
- | speakleash/mistral-apt3-7B/spi-e0_hf | 35.50 | 62.14 | **87.48** | 132.78 |
159
 
160
  SpeakLeash models have one of the best scores in the RAG Reader task.
161
  We have managed to increase Average score by almost 9 pp. in comparison to Mistral-7B-v0.1.
162
  In our subjective evaluations of chatting skills SpeakLeash models perform better than other models with higher Average scores.
163
 
164
-
 
165
 
166
  ## Limitations and Biases
167
 
@@ -212,7 +213,7 @@ The model could not have been created without the commitment and work of the ent
212
  [Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/),
213
  and many other wonderful researchers and enthusiasts of the AI world.
214
 
215
- Members of the ACK Cyfronet AGH team:
216
  [Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/).
217
 
218
  ## Contact Us
 
119
  - Reader (Generator) - open book question answering task, commonly used in RAG
120
  - Perplexity (lower is better) - as a bonus, does not correlate with other scores and should not be used for model comparison
121
 
122
+ As of April 3, 2024, the following table showcases the current scores of pretrained and continuously pretrained models according to the Open PL LLM Leaderboard, evaluated in a 5-shot setting:
123
 
124
  | | Average | RAG Reranking | RAG Reader | Perplexity |
125
  |--------------------------------------------------------------------------------------|----------:|--------------:|-----------:|-----------:|
 
137
  | mistralai/Mistral-7B-Instruct-v0.2 | 40.29 | 72.58 | 79.39 | 2088.08 |
138
  | teknium/OpenHermes-2.5-Mistral-7B | 42.64 | 70.63 | 80.25 | 1463.00 |
139
  | openchat/openchat-3.5-1210 | 44.17 | 71.76 | 82.15 | 1923.83 |
140
+ | speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e1_70c70cc6 (experimental) | 45.44 | 71.27 | 91.50 | 279.24 |
141
  | Nexusflow/Starling-LM-7B-beta | 45.69 | 74.58 | 81.22 | 1161.54 |
142
  | openchat/openchat-3.5-0106 | 47.32 | 74.71 | 83.60 | 1106.56 |
143
  | berkeley-nest/Starling-LM-7B-alpha | **47.46** | **75.73** | 82.86 | 1438.04 |
 
155
  | mistralai/Mistral-7B-v0.1 | 30.67 | 60.35 | 85.39 | 857.32 |
156
  | internlm/internlm2-7b | 33.03 | 69.39 | 73.63 | 5498.23 |
157
  | alpindale/Mistral-7B-v0.2-hf | 33.05 | 60.23 | 85.21 | 932.60 |
158
+ | speakleash/mistral-apt3-7B/spi-e0_hf (experimental) | 35.50 | 62.14 | **87.48** | 132.78 |
159
 
160
  SpeakLeash models have one of the best scores in the RAG Reader task.
161
  We have managed to increase Average score by almost 9 pp. in comparison to Mistral-7B-v0.1.
162
  In our subjective evaluations of chatting skills SpeakLeash models perform better than other models with higher Average scores.
163
 
164
+ The results in the above table were obtained without utilizing instruction templates for instructional models, instead treating them like base models.
165
+ This approach could skew the results, as instructional models are optimized with specific instructions in mind.
166
 
167
  ## Limitations and Biases
168
 
 
213
  [Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/),
214
  and many other wonderful researchers and enthusiasts of the AI world.
215
 
216
+ Members of the ACK Cyfronet AGH team providing valuable support and expertise:
217
  [Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/).
218
 
219
  ## Contact Us