svanlin-tencent
commited on
Commit
•
5ad5517
1
Parent(s):
36757f9
init
Browse files
README.md
CHANGED
@@ -24,29 +24,31 @@ By open-sourcing the Hunyuan-Large model and revealing related technical details
|
|
24 |
|
25 |
**Hunyuan-Large pre-trained model** achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
|
26 |
|
|
|
27 |
| Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
|
28 |
|------------------|---------------|--------------|---------------|-------------|---------------|
|
29 |
-
| MMLU | 85.2 | 79.3 | 77.8 | 78.5 | 88.4 |
|
30 |
-
| MMLU-Pro | 61.6 | 53.8 | 49.5 | - | 60.2 |
|
31 |
-
| BBH | 85.9 | 81.6 | 78.9 | 78.9 | 86.3 |
|
32 |
-
| HellaSwag | - | - | 88.7 | 87.8 | 86.8 |
|
33 |
-
| CommonsenseQA | 85.8 | 84.1 | 78.5 | - | 92.9 |
|
34 |
-
| WinoGrande | 86.7 | 85.3 | 85.0 | 84.9 | 88.7 |
|
35 |
-
| PIQA | - | - | 83.6 | 83.7 | 88.3 |
|
36 |
-
| SIQA | - | - | 64.6 | - | 83.6 |
|
37 |
-
| NaturalQuestions | - | - | 39.6 | 38.7 | 52.8 |
|
38 |
-
| BoolQ | 80.0 | 79.4 | 87.4 | 84.0 | 92.9 |
|
39 |
-
| DROP | 84.8 | 79.6 | 80.4 | 80.1 | 88.9 |
|
40 |
-
| ARC-C | 96.1 | 92.9 | 91.2 | 92.4 | 95.0 |
|
41 |
-
| TriviaQA | - | - | 82.1 | 79.9 | 89.2 |
|
42 |
-
| CMMLU | - | - | 60.0 | 84.0 | 90.2 |
|
43 |
-
| C-Eval | - | - | 59.6 | 81.7 | 91.9 |
|
44 |
-
| C3 | - | - | 71.4 | 77.4 | 82.3 |
|
45 |
-
| GSM8K | 89.0 | 83.7 | 83.7 | 79.2 | 92.8 |
|
46 |
-
| MATH | 53.8 | 41.4 | 42.5 | 43.6 | 69.8 |
|
47 |
-
| CMATH | - | - | 72.3 | 78.7 | 91.3 |
|
48 |
-
| HumanEval | 61.0 | 58.5 | 53.1 | 48.8 | 71.4 |
|
49 |
-
| MBPP | 73.4 | 68.6 | 64.2 | 66.6 | 72.6 |
|
|
|
50 |
|
51 |
**Hunyuan-Large-Instruct achieves** consistent improvements on most types of tasks compared to LLMs having similar
|
52 |
activated parameters, indicating the effectiveness of our post-training. Delving into the model performance
|
@@ -57,22 +59,23 @@ capabilities across a wide array of language understanding tasks. The model’s
|
|
57 |
on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
|
58 |
Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.
|
59 |
|
|
|
|
|
60 |
| Model | LLama3.1 405B Inst. | LLama3.1 70B Inst. | Mixtral 8x22B Inst. | DeepSeekV2.5 Chat | Hunyuan-Large Inst. |
|
61 |
|----------------------|---------------------|--------------------|---------------------|-------------------|---------------------|
|
62 |
-
| MMLU | 87.3 | 83.6 | 77.8 | 80.4 | 89.9 |
|
63 |
-
| CMMLU | - | - | 61.0 | 79.5 | 90.4 |
|
64 |
-
| C-Eval | - | - | 60.0 | 79.9 | 88.6 |
|
65 |
-
| BBH | - | - | 82.0 | 87.1 | 81.2 |
|
66 |
-
| HellaSwag | - | - | 86.0 | 90.3 | 88.5 |
|
67 |
-
| ARC-C | 96.9 | 94.8 | 91.5 | 92.9 | 94.6 |
|
68 |
-
| DROP | - | - | 67.5 | 79.5 | 88.3 |
|
69 |
-
| GPQA_diamond | 50.7 | 46.7 | 38.4 | 42.4 | 42.4 |
|
70 |
-
| MATH | 73.8 | 68.0 | 51.0 | 74.7 | 77.4 |
|
71 |
-
| HumanEval | 89.0 | 80.5 | 75.6 | 89.0 | 90.0 |
|
72 |
-
| AlignBench | 6.0 | 5.9 | 6.2 | 8.0 | 8.3 |
|
73 |
-
| MT-Bench | 9.1 | 8.8 | 8.1 | 9.0 | 9.4 |
|
74 |
-
| IFEval strict-prompt | 86.0 | 83.6 | 71.2 | - | 85.0 |
|
75 |
-
|
76 |
|
77 |
|
78 |
|
|
|
24 |
|
25 |
**Hunyuan-Large pre-trained model** achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
|
26 |
|
27 |
+
|
28 |
| Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
|
29 |
|------------------|---------------|--------------|---------------|-------------|---------------|
|
30 |
+
| MMLU | 85.2 | 79.3 | 77.8 | 78.5 | **88.4** |
|
31 |
+
| MMLU-Pro | **61.6** | 53.8 | 49.5 | - | 60.2 |
|
32 |
+
| BBH | 85.9 | 81.6 | 78.9 | 78.9 | **86.3** |
|
33 |
+
| HellaSwag | - | - | **88.7** | 87.8 | 86.8 |
|
34 |
+
| CommonsenseQA | 85.8 | 84.1 | 78.5 | - | **92.9** |
|
35 |
+
| WinoGrande | 86.7 | 85.3 | 85.0 | 84.9 | **88.7** |
|
36 |
+
| PIQA | - | - | 83.6 | 83.7 | **88.3** |
|
37 |
+
| SIQA | - | - | 64.6 | - | **83.6** |
|
38 |
+
| NaturalQuestions | - | - | 39.6 | 38.7 | **52.8** |
|
39 |
+
| BoolQ | 80.0 | 79.4 | 87.4 | 84.0 | **92.9** |
|
40 |
+
| DROP | 84.8 | 79.6 | 80.4 | 80.1 | **88.9** |
|
41 |
+
| ARC-C | **96.1** | 92.9 | 91.2 | 92.4 | 95.0 |
|
42 |
+
| TriviaQA | - | - | 82.1 | 79.9 | **89.2** |
|
43 |
+
| CMMLU | - | - | 60.0 | 84.0 | **90.2** |
|
44 |
+
| C-Eval | - | - | 59.6 | 81.7 | **91.9** |
|
45 |
+
| C3 | - | - | 71.4 | 77.4 | **82.3** |
|
46 |
+
| GSM8K | 89.0 | 83.7 | 83.7 | 79.2 | **92.8** |
|
47 |
+
| MATH | 53.8 | 41.4 | 42.5 | 43.6 | **69.8** |
|
48 |
+
| CMATH | - | - | 72.3 | 78.7 | **91.3** |
|
49 |
+
| HumanEval | 61.0 | 58.5 | 53.1 | 48.8 | **71.4** |
|
50 |
+
| MBPP | **73.4** | 68.6 | 64.2 | 66.6 | 72.6 |
|
51 |
+
|
52 |
|
53 |
**Hunyuan-Large-Instruct achieves** consistent improvements on most types of tasks compared to LLMs having similar
|
54 |
activated parameters, indicating the effectiveness of our post-training. Delving into the model performance
|
|
|
59 |
on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
|
60 |
Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.
|
61 |
|
62 |
+
|
63 |
+
|
64 |
| Model | LLama3.1 405B Inst. | LLama3.1 70B Inst. | Mixtral 8x22B Inst. | DeepSeekV2.5 Chat | Hunyuan-Large Inst. |
|
65 |
|----------------------|---------------------|--------------------|---------------------|-------------------|---------------------|
|
66 |
+
| MMLU | 87.3 | 83.6 | 77.8 | 80.4 | **89.9** |
|
67 |
+
| CMMLU | - | - | 61.0 | 79.5 | **90.4** |
|
68 |
+
| C-Eval | - | - | 60.0 | 79.9 | **88.6** |
|
69 |
+
| BBH | - | - | 82.0 | **87.1** | 81.2 |
|
70 |
+
| HellaSwag | - | - | 86.0 | **90.3** | 88.5 |
|
71 |
+
| ARC-C | **96.9** | 94.8 | 91.5 | 92.9 | 94.6 |
|
72 |
+
| DROP | - | - | 67.5 | 79.5 | **88.3** |
|
73 |
+
| GPQA_diamond | **50.7** | 46.7 | 38.4 | 42.4 | 42.4 |
|
74 |
+
| MATH | 73.8 | 68.0 | 51.0 | 74.7 | **77.4** |
|
75 |
+
| HumanEval | 89.0 | 80.5 | 75.6 | 89.0 | **90.0** |
|
76 |
+
| AlignBench | 6.0 | 5.9 | 6.2 | 8.0 | **8.3** |
|
77 |
+
| MT-Bench | 9.1 | 8.8 | 8.1 | 9.0 | **9.4** |
|
78 |
+
| IFEval strict-prompt | **86.0** | 83.6 | 71.2 | - | 85.0 |
|
|
|
79 |
|
80 |
|
81 |
|