vaishaal commited on
Commit
e9e0a6b
1 Parent(s): fc78421

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -17
README.md CHANGED
@@ -69,23 +69,64 @@ For more detailed training information, please refer to Section 3.4 and Appendix
69
 
70
  Here are the evaluation results for DCLM-Baseline-7B on various tasks:
71
 
72
- | Task | Score |
73
- |--------------------------|---------|
74
- | CORE | 57.1 |
75
- | MMLU (5-shot) | 63.7 |
76
- | EXTENDED | 45.4 |
77
- | ARC Challenge | 57.68 |
78
- | ARC Easy | 81.82 |
79
- | BoolQ | 83.36 |
80
- | COPA | 87.00 |
81
- | HellaSwag | 80.68 |
82
- | OpenBookQA | 46.40 |
83
- | PIQA | 80.85 |
84
- | Winogrande | 73.80 |
85
- | AGI Eval LSAT AR (3-shot)| 29.57 |
86
- | GSM8K (CoT) | 17.13 |
87
-
88
- For a complete list of evaluation results, please refer to the full evaluation JSON file.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ## Limitations and Biases
91
 
 
69
 
70
  Here are the evaluation results for DCLM-Baseline-7B on various tasks:
71
 
72
+ | Task | Score |
73
+ |------|-------|
74
+ | MMLU (zero-shot) | 0.5766 |
75
+ | MMLU (few-shot) | 0.6372 |
76
+ | HellaSwag (zero-shot) | 0.7987 |
77
+ | HellaSwag | 0.8043 |
78
+ | Jeopardy | 0.4745 |
79
+ | TriviaQA | 0.5270 |
80
+ | GSM8K (CoT) | 0.0250 |
81
+ | AGI Eval SAT Math (CoT) | 0.0136 |
82
+ | AQuA (CoT) | 0.0490 |
83
+ | SVAMP (CoT) | 0.4900 |
84
+ | BigBench QA Wikidata | 0.7120 |
85
+ | ARC Easy | 0.8220 |
86
+ | ARC Challenge | 0.5990 |
87
+ | BigBench Misconceptions | 0.6986 |
88
+ | COPA | 0.8500 |
89
+ | SIQA | 0.8291 |
90
+ | CommonsenseQA | 0.8018 |
91
+ | PIQA | 0.8128 |
92
+ | OpenBookQA | 0.4540 |
93
+ | BigBench Novel Concepts | 0.7188 |
94
+ | BigBench Strange Stories | 0.7586 |
95
+ | BigBench Strategy QA | 0.6173 |
96
+ | LAMBADA | 0.8220 |
97
+ | Winograd | 0.8828 |
98
+ | Winogrande | 0.7269 |
99
+ | BigBench Conlang Translation | 0.0244 |
100
+ | BigBench Language Identification | 0.5219 |
101
+ | BigBench Conceptual Combinations | 0.6990 |
102
+ | BigBench Elementary Math QA | 0.3431 |
103
+ | BigBench Dyck Languages | 0.4930 |
104
+ | AGI Eval LSAT AR | 0.2435 |
105
+ | BigBench CS Algorithms | 0.6121 |
106
+ | BigBench Logical Deduction | 0.3620 |
107
+ | BigBench Operators | 0.4857 |
108
+ | BigBench Repeat Copy Logic | 0.4063 |
109
+ | Simple Arithmetic (no spaces) | 0.2940 |
110
+ | Simple Arithmetic (with spaces) | 0.3110 |
111
+ | MathQA | 0.3098 |
112
+ | LogiQA | 0.4132 |
113
+ | PubMedQA | 0.7060 |
114
+ | SQuAD | 0.5856 |
115
+ | AGI Eval LSAT RC | 0.6716 |
116
+ | AGI Eval LSAT LR | 0.5392 |
117
+ | CoQA | 0.4074 |
118
+ | BigBench Understanding Fables | 0.6825 |
119
+ | BoolQ | 0.8343 |
120
+ | AGI Eval SAT EN | 0.7670 |
121
+ | Winogender MC (Female) | 0.6000 |
122
+ | Winogender MC (Male) | 0.5500 |
123
+ | Enterprise PII Classification | 0.7676 |
124
+ | BBQ | 0.6912 |
125
+ | GPQA Main | 0.2612 |
126
+ | GPQA Diamond | 0.2475 |
127
+
128
+ Note: All scores are presented as decimal values between 0 and 1, representing the proportion of correct answers or the model's performance on each task.
129
+
130
 
131
  ## Limitations and Biases
132