sarath-shekkizhar commited on
Commit
a85d31e
·
verified ·
1 Parent(s): de770dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -0
README.md CHANGED
@@ -113,6 +113,13 @@ The task involves evaluation on `6` key benchmarks across reasoning and knowledg
113
 
114
  *The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
115
 
 
 
 
 
 
 
 
116
  # Limitations
117
 
118
  Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
 
113
 
114
  *The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
115
 
116
+ **Note**: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.
117
+ | Model | First Turn | Second Turn | Average |
118
+ | --- | --- | --- | --- |
119
+ | **tenyx/Llama3-TenyxChat-70B** | 8.12 | 8.18 | 8.15 |
120
+ | *meta-llama/Llama3-TenyxChat-70B* | 8.05 | 7.87 | 7.96 |
121
+ | MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 | 8.05 | 7.82 | 7.93 |
122
+
123
  # Limitations
124
 
125
  Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.