svanlin-tencent commited on
Commit
e183385
1 Parent(s): f5ca05e
Files changed (1) hide show
  1. README.md +28 -1
README.md CHANGED
@@ -20,7 +20,34 @@ By open-sourcing the Hunyuan-Large model and revealing related technical details
20
 
21
   
22
 
23
- ### Benchmark
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
 
26
  ### Citation
 
20
 
21
   
22
 
23
+ ## Benchmark Evaluation
24
+ Hunyuan-Large achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
25
+
26
+ | Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
27
+ | ---------------- | ------------- | ------------ | ------------- | ------------ | ------------- |
28
+ | MMLU | 85.2 | 79.5 | 77.6 | 78.5 | 88.4 |
29
+ | MMLU-Pro | 61.6 | 53.8 | 49.5 | - | 60.2 |
30
+ | BBH | 85.9 | 81.6 | 78.9 | 78.9 | 86.3 |
31
+ | HellaSwag--88.7 | | | | 87.8 | 86.8 |
32
+ | CommonsenseQA | 85.8 | 84.1 | 78.5 | - | 92.9 |
33
+ | WinoGrande | 86.7 | 85.3 | 83.7 | 84.9 | 88.7 |
34
+ | PIQA | - | - | 83.6 | 83.7 | 88.3 |
35
+ | SIQA | - | - | 64.6 | - | 83.6 |
36
+ | NaturalQuestions | - | - | 40.2 | 38.7 | 52.8 |
37
+ | BoolQ | 80 | 79.4 | 87.4 | 84 | 92.9 |
38
+ | DROP | 84.8 | 79.6 | 80.4 | 80.1 | 88.9 |
39
+ | ARC-C | 96.1 | 92.9 | 91.2 | 92.4 | 95 |
40
+ | TriviaQA | - | - | 82.1 | 79.9 | 89.2 |
41
+ | CMMLU | - | - | 60 | 84 | 90.2 |
42
+ | C-Eval | - | - | 59.6 | 81.7 | 91.9 |
43
+ | C3 | - | - | 71.4 | 77.4 | 82.3 |
44
+ | GSM8K | 89 | 83.7 | 83.7 | 79.2 | 92.8 |
45
+ | MATH | 53.8 | 41.4 | 41.8 | 43.6 | 69.8 |
46
+ | CMATH | - | - | 72.3 | 78.7 | 91.3 |
47
+ | HumanEval | - | - | 53.1 | 48.8 | 71.4 |
48
+ | MBPP | - | - | 78.6 | 73.9 | 87.3 |
49
+
50
+
51
 
52
 
53
  ### Citation