svanlin-tencent commited on
Commit
d14b84f
1 Parent(s): 5c43911
Files changed (1) hide show
  1. README.md +54 -26
README.md CHANGED
@@ -21,31 +21,59 @@ By open-sourcing the Hunyuan-Large model and revealing related technical details
21
   
22
 
23
  ## Benchmark Evaluation
24
- Hunyuan-Large achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
25
-
26
- | Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
27
- | ---------------- | ------------- | ------------ | ------------- | ------------ | ------------- |
28
- | MMLU | 85.2 | 79.5 | 77.6 | 78.5 | 88.4 |
29
- | MMLU-Pro | 61.6 | 53.8 | 49.5 | - | 60.2 |
30
- | BBH | 85.9 | 81.6 | 78.9 | 78.9 | 86.3 |
31
- | HellaSwag | - | - | 88.7 | 87.8 | 86.8 |
32
- | CommonsenseQA | 85.8 | 84.1 | 78.5 | - | 92.9 |
33
- | WinoGrande | 86.7 | 85.3 | 83.7 | 84.9 | 88.7 |
34
- | PIQA | - | - | 83.6 | 83.7 | 88.3 |
35
- | SIQA | - | - | 64.6 | - | 83.6 |
36
- | NaturalQuestions | - | - | 40.2 | 38.7 | 52.8 |
37
- | BoolQ | 80 | 79.4 | 87.4 | 84 | 92.9 |
38
- | DROP | 84.8 | 79.6 | 80.4 | 80.1 | 88.9 |
39
- | ARC-C | 96.1 | 92.9 | 91.2 | 92.4 | 95 |
40
- | TriviaQA | - | - | 82.1 | 79.9 | 89.2 |
41
- | CMMLU | - | - | 60 | 84 | 90.2 |
42
- | C-Eval | - | - | 59.6 | 81.7 | 91.9 |
43
- | C3 | - | - | 71.4 | 77.4 | 82.3 |
44
- | GSM8K | 89 | 83.7 | 83.7 | 79.2 | 92.8 |
45
- | MATH | 53.8 | 41.4 | 41.8 | 43.6 | 69.8 |
46
- | CMATH | - | - | 72.3 | 78.7 | 91.3 |
47
- | HumanEval | - | - | 53.1 | 48.8 | 71.4 |
48
- | MBPP | - | - | 78.6 | 73.9 | 87.3 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
 
51
 
@@ -56,7 +84,7 @@ If you find our work helpful, feel free to give us a cite.
56
  ```
57
  @article{Tencent-Hunyuan-Large,
58
  title={Hunyuan-Large Technical Report},
59
- author={Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li Xuemeng Huang, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Fengzong Lian Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang Kan Wu, Dengpeng Wu, Guanghu1 Xu, Shaohua Chen, Fusheng Xiang, Shuang Chen, Xiao Feng Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Suncong Zheng, Xiong Kuang, Jianglu Hu Dian Jiao, Yiqi Chen, Jinbao Xue, Yangyu Tao, Chengzhong Xu, Winsony Hu, Feng Zhang, Jianshen Zhu Zhanhui Kang, Di Wang, Jie Jiang},
60
  journal={arXiv:},
61
  year={2024}
62
  }
 
21
   
22
 
23
  ## Benchmark Evaluation
24
+
25
+ **Hunyuan-Large pre-trained model** achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
26
+
27
+ | Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
28
+ |------------------|---------------|--------------|---------------|-------------|---------------|
29
+ | MMLU | 85.2 | 79.3 | 77.8 | 78.5 | 88.4 |
30
+ | MMLU-Pro | 61.6 | 53.8 | 49.5 | - | 60.2 |
31
+ | BBH | 85.9 | 81.6 | 78.9 | 78.9 | 86.3 |
32
+ | HellaSwag | - | - | 88.7 | 87.8 | 86.8 |
33
+ | CommonsenseQA | 85.8 | 84.1 | 78.5 | - | 92.9 |
34
+ | WinoGrande | 86.7 | 85.3 | 85.0 | 84.9 | 88.7 |
35
+ | PIQA | - | - | 83.6 | 83.7 | 88.3 |
36
+ | SIQA | - | - | 64.6 | - | 83.6 |
37
+ | NaturalQuestions | - | - | 39.6 | 38.7 | 52.8 |
38
+ | BoolQ | 80.0 | 79.4 | 87.4 | 84.0 | 92.9 |
39
+ | DROP | 84.8 | 79.6 | 80.4 | 80.1 | 88.9 |
40
+ | ARC-C | 96.1 | 92.9 | 91.2 | 92.4 | 95.0 |
41
+ | TriviaQA | - | - | 82.1 | 79.9 | 89.2 |
42
+ | CMMLU | - | - | 60.0 | 84.0 | 90.2 |
43
+ | C-Eval | - | - | 59.6 | 81.7 | 91.9 |
44
+ | C3 | - | - | 71.4 | 77.4 | 82.3 |
45
+ | GSM8K | 89.0 | 83.7 | 83.7 | 79.2 | 92.8 |
46
+ | MATH | 53.8 | 41.4 | 42.5 | 43.6 | 69.8 |
47
+ | CMATH | - | - | 72.3 | 78.7 | 91.3 |
48
+ | HumanEval | 61.0 | 58.5 | 53.1 | 48.8 | 71.4 |
49
+ | MBPP | 73.4 | 68.6 | 64.2 | 66.6 | 72.6 |
50
+
51
+ **Hunyuan-Large-Instruct achieves** consistent improvements on most types of tasks compared to LLMs having similar
52
+ activated parameters, indicating the effectiveness of our post-training. Delving into the model performance
53
+ in different categories of benchmarks, we find that our instruct model achieves the best performance on MMLU and MATH dataset.
54
+ Notably, on the MMLU dataset, our model demonstrates a significant improvement, outperforming the LLama3.1-405B model by 2.6%.
55
+ This enhancement is not just marginal but indicative of the Hunyuan-Large-Instruct’s superior understanding and reasoning
56
+ capabilities across a wide array of language understanding tasks. The model’s prowess is further underscored in its performance
57
+ on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
58
+ Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.
59
+
60
+ | Model | LLama3.1 405B Inst. | LLama3.1 70B Inst. | Mixtral 8x22B Inst. | DeepSeekV2.5 Chat | Hunyuan-Large Inst. |
61
+ |----------------------|---------------------|--------------------|---------------------|-------------------|---------------------|
62
+ | MMLU | 87.3 | 83.6 | 77.8 | 80.4 | 89.9 |
63
+ | CMMLU | - | - | 61.0 | 79.5 | 90.4 |
64
+ | C-Eval | - | - | 60.0 | 79.9 | 88.6 |
65
+ | BBH | - | - | 82.0 | 87.1 | 81.2 |
66
+ | HellaSwag | - | - | 86.0 | 90.3 | 88.5 |
67
+ | ARC-C | 96.9 | 94.8 | 91.5 | 92.9 | 94.6 |
68
+ | DROP | - | - | 67.5 | 79.5 | 88.3 |
69
+ | GPQA_diamond | 50.7 | 46.7 | 38.4 | 42.4 | 42.4 |
70
+ | MATH | 73.8 | 68.0 | 51.0 | 74.7 | 77.4 |
71
+ | HumanEval | 89.0 | 80.5 | 75.6 | 89.0 | 90.0 |
72
+ | AlignBench | 6.0 | 5.9 | 6.2 | 8.0 | 8.3 |
73
+ | MT-Bench | 9.1 | 8.8 | 8.1 | 9.0 | 9.4 |
74
+ | IFEval strict-prompt | 86.0 | 83.6 | 71.2 | - | 85.0 |
75
+
76
+
77
 
78
 
79
 
 
84
  ```
85
  @article{Tencent-Hunyuan-Large,
86
  title={Hunyuan-Large Technical Report},
87
+ author={Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao She, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, and Jie Jiang.},
88
  journal={arXiv:},
89
  year={2024}
90
  }