apple
/

DCLM-7B

Transformers

Safetensors

openlm

Inference Endpoints

Model card Files Files and versions Community

vaishaal commited on Jul 16

Commit

e9e0a6b

•

1 Parent(s): fc78421

Update README.md

Browse files

Files changed (1) hide show

README.md +58 -17

README.md CHANGED Viewed

@@ -69,23 +69,64 @@ For more detailed training information, please refer to Section 3.4 and Appendix
 Here are the evaluation results for DCLM-Baseline-7B on various tasks:
-| Task                     | Score   |
-|--------------------------|---------|
-| CORE                     | 57.1    |
-| MMLU (5-shot)            | 63.7    |
-| EXTENDED                 | 45.4    |
-| ARC Challenge            | 57.68   |
-| ARC Easy                 | 81.82   |
-| BoolQ                    | 83.36   |
-| COPA                     | 87.00   |
-| HellaSwag                | 80.68   |
-| OpenBookQA               | 46.40   |
-| PIQA                     | 80.85   |
-| Winogrande               | 73.80   |
-| AGI Eval LSAT AR (3-shot)| 29.57   |
-| GSM8K (CoT)              | 17.13   |
-For a complete list of evaluation results, please refer to the full evaluation JSON file.
 ## Limitations and Biases

 Here are the evaluation results for DCLM-Baseline-7B on various tasks:
+| Task | Score |
+|------|-------|
+| MMLU (zero-shot) | 0.5766 |
+| MMLU (few-shot) | 0.6372 |
+| HellaSwag (zero-shot) | 0.7987 |
+| HellaSwag | 0.8043 |
+| Jeopardy | 0.4745 |
+| TriviaQA | 0.5270 |
+| GSM8K (CoT) | 0.0250 |
+| AGI Eval SAT Math (CoT) | 0.0136 |
+| AQuA (CoT) | 0.0490 |
+| SVAMP (CoT) | 0.4900 |
+| BigBench QA Wikidata | 0.7120 |
+| ARC Easy | 0.8220 |
+| ARC Challenge | 0.5990 |
+| BigBench Misconceptions | 0.6986 |
+| COPA | 0.8500 |
+| SIQA | 0.8291 |
+| CommonsenseQA | 0.8018 |
+| PIQA | 0.8128 |
+| OpenBookQA | 0.4540 |
+| BigBench Novel Concepts | 0.7188 |
+| BigBench Strange Stories | 0.7586 |
+| BigBench Strategy QA | 0.6173 |
+| LAMBADA | 0.8220 |
+| Winograd | 0.8828 |
+| Winogrande | 0.7269 |
+| BigBench Conlang Translation | 0.0244 |
+| BigBench Language Identification | 0.5219 |
+| BigBench Conceptual Combinations | 0.6990 |
+| BigBench Elementary Math QA | 0.3431 |
+| BigBench Dyck Languages | 0.4930 |
+| AGI Eval LSAT AR | 0.2435 |
+| BigBench CS Algorithms | 0.6121 |
+| BigBench Logical Deduction | 0.3620 |
+| BigBench Operators | 0.4857 |
+| BigBench Repeat Copy Logic | 0.4063 |
+| Simple Arithmetic (no spaces) | 0.2940 |
+| Simple Arithmetic (with spaces) | 0.3110 |
+| MathQA | 0.3098 |
+| LogiQA | 0.4132 |
+| PubMedQA | 0.7060 |
+| SQuAD | 0.5856 |
+| AGI Eval LSAT RC | 0.6716 |
+| AGI Eval LSAT LR | 0.5392 |
+| CoQA | 0.4074 |
+| BigBench Understanding Fables | 0.6825 |
+| BoolQ | 0.8343 |
+| AGI Eval SAT EN | 0.7670 |
+| Winogender MC (Female) | 0.6000 |
+| Winogender MC (Male) | 0.5500 |
+| Enterprise PII Classification | 0.7676 |
+| BBQ | 0.6912 |
+| GPQA Main | 0.2612 |
+| GPQA Diamond | 0.2475 |
+Note: All scores are presented as decimal values between 0 and 1, representing the proportion of correct answers or the model's performance on each task.
 ## Limitations and Biases