Update README.md
Browse files
README.md
CHANGED
@@ -69,23 +69,64 @@ For more detailed training information, please refer to Section 3.4 and Appendix
|
|
69 |
|
70 |
Here are the evaluation results for DCLM-Baseline-7B on various tasks:
|
71 |
|
72 |
-
| Task
|
73 |
-
|
74 |
-
|
|
75 |
-
| MMLU (
|
76 |
-
|
|
77 |
-
|
|
78 |
-
|
|
79 |
-
|
|
80 |
-
|
|
81 |
-
|
|
82 |
-
|
|
83 |
-
|
|
84 |
-
|
|
85 |
-
|
|
86 |
-
|
|
87 |
-
|
88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
## Limitations and Biases
|
91 |
|
|
|
69 |
|
70 |
Here are the evaluation results for DCLM-Baseline-7B on various tasks:
|
71 |
|
72 |
+
| Task | Score |
|
73 |
+
|------|-------|
|
74 |
+
| MMLU (zero-shot) | 0.5766 |
|
75 |
+
| MMLU (few-shot) | 0.6372 |
|
76 |
+
| HellaSwag (zero-shot) | 0.7987 |
|
77 |
+
| HellaSwag | 0.8043 |
|
78 |
+
| Jeopardy | 0.4745 |
|
79 |
+
| TriviaQA | 0.5270 |
|
80 |
+
| GSM8K (CoT) | 0.0250 |
|
81 |
+
| AGI Eval SAT Math (CoT) | 0.0136 |
|
82 |
+
| AQuA (CoT) | 0.0490 |
|
83 |
+
| SVAMP (CoT) | 0.4900 |
|
84 |
+
| BigBench QA Wikidata | 0.7120 |
|
85 |
+
| ARC Easy | 0.8220 |
|
86 |
+
| ARC Challenge | 0.5990 |
|
87 |
+
| BigBench Misconceptions | 0.6986 |
|
88 |
+
| COPA | 0.8500 |
|
89 |
+
| SIQA | 0.8291 |
|
90 |
+
| CommonsenseQA | 0.8018 |
|
91 |
+
| PIQA | 0.8128 |
|
92 |
+
| OpenBookQA | 0.4540 |
|
93 |
+
| BigBench Novel Concepts | 0.7188 |
|
94 |
+
| BigBench Strange Stories | 0.7586 |
|
95 |
+
| BigBench Strategy QA | 0.6173 |
|
96 |
+
| LAMBADA | 0.8220 |
|
97 |
+
| Winograd | 0.8828 |
|
98 |
+
| Winogrande | 0.7269 |
|
99 |
+
| BigBench Conlang Translation | 0.0244 |
|
100 |
+
| BigBench Language Identification | 0.5219 |
|
101 |
+
| BigBench Conceptual Combinations | 0.6990 |
|
102 |
+
| BigBench Elementary Math QA | 0.3431 |
|
103 |
+
| BigBench Dyck Languages | 0.4930 |
|
104 |
+
| AGI Eval LSAT AR | 0.2435 |
|
105 |
+
| BigBench CS Algorithms | 0.6121 |
|
106 |
+
| BigBench Logical Deduction | 0.3620 |
|
107 |
+
| BigBench Operators | 0.4857 |
|
108 |
+
| BigBench Repeat Copy Logic | 0.4063 |
|
109 |
+
| Simple Arithmetic (no spaces) | 0.2940 |
|
110 |
+
| Simple Arithmetic (with spaces) | 0.3110 |
|
111 |
+
| MathQA | 0.3098 |
|
112 |
+
| LogiQA | 0.4132 |
|
113 |
+
| PubMedQA | 0.7060 |
|
114 |
+
| SQuAD | 0.5856 |
|
115 |
+
| AGI Eval LSAT RC | 0.6716 |
|
116 |
+
| AGI Eval LSAT LR | 0.5392 |
|
117 |
+
| CoQA | 0.4074 |
|
118 |
+
| BigBench Understanding Fables | 0.6825 |
|
119 |
+
| BoolQ | 0.8343 |
|
120 |
+
| AGI Eval SAT EN | 0.7670 |
|
121 |
+
| Winogender MC (Female) | 0.6000 |
|
122 |
+
| Winogender MC (Male) | 0.5500 |
|
123 |
+
| Enterprise PII Classification | 0.7676 |
|
124 |
+
| BBQ | 0.6912 |
|
125 |
+
| GPQA Main | 0.2612 |
|
126 |
+
| GPQA Diamond | 0.2475 |
|
127 |
+
|
128 |
+
Note: All scores are presented as decimal values between 0 and 1, representing the proportion of correct answers or the model's performance on each task.
|
129 |
+
|
130 |
|
131 |
## Limitations and Biases
|
132 |
|