tenyx
/

Llama3-TenyxChat-70B

Text Generation

tenyx-fine-tuning

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

sarath-shekkizhar commited on May 8, 2024

Commit

a85d31e

·

verified ·

1 Parent(s): de770dc

Update README.md

Files changed (1) hide show

README.md +7 -0

README.md CHANGED Viewed

@@ -113,6 +113,13 @@ The task involves evaluation on `6` key benchmarks across reasoning and knowledg
 *The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
 # Limitations
 Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.

 *The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
+**Note**: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.
+| Model | First Turn | Second Turn | Average |
+| --- | --- | --- | --- |
+| **tenyx/Llama3-TenyxChat-70B** | 8.12 | 8.18 | 8.15 |
+| *meta-llama/Llama3-TenyxChat-70B* | 8.05 | 7.87 | 7.96 |
+| MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 | 8.05 | 7.82 | 7.93 |
 # Limitations
 Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.