sarath-shekkizhar
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -113,6 +113,13 @@ The task involves evaluation on `6` key benchmarks across reasoning and knowledg
|
|
113 |
|
114 |
*The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
|
115 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
# Limitations
|
117 |
|
118 |
Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
|
|
|
113 |
|
114 |
*The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
|
115 |
|
116 |
+
**Note**: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.
|
117 |
+
| Model | First Turn | Second Turn | Average |
|
118 |
+
| --- | --- | --- | --- |
|
119 |
+
| **tenyx/Llama3-TenyxChat-70B** | 8.12 | 8.18 | 8.15 |
|
120 |
+
| *meta-llama/Llama3-TenyxChat-70B* | 8.05 | 7.87 | 7.96 |
|
121 |
+
| MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 | 8.05 | 7.82 | 7.93 |
|
122 |
+
|
123 |
# Limitations
|
124 |
|
125 |
Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
|