Achal Dave
commited on
Commit
•
77ec709
1
Parent(s):
39adea4
README updates
Browse files
README.md
CHANGED
@@ -7,9 +7,9 @@ license: apache-2.0
|
|
7 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/63118add64939fabc0108b28/BB42g4V8HTxb5dR4tcy8A.png" alt="DCLM Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
8 |
|
9 |
|
10 |
-
# Model Card for DCLM-
|
11 |
|
12 |
-
DCLM-
|
13 |
|
14 |
## Model Details
|
15 |
|
@@ -48,12 +48,13 @@ The model was trained using the following setup:
|
|
48 |
- **Total Training Tokens:** 2.6T
|
49 |
- **Hardware:** Trained on H100 GPUs
|
50 |
|
51 |
-
For more detailed training information, please refer to
|
|
|
52 |
To ensure our trained model is broadly useful, including for math and coding tasks, we combine our 3.8T [DCLM-BASELINE](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) with the [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) and [ProofPile2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) data to arrive at a 4.1T token dataset.
|
53 |
|
54 |
## Evaluation
|
55 |
|
56 |
-
Here are the evaluation results for DCLM-
|
57 |
|
58 |
| Task | Score |
|
59 |
|------------------------------------------|---------|
|
@@ -116,7 +117,7 @@ Note: All scores are presented as decimal values between 0 and 1, representing t
|
|
116 |
|
117 |
## Limitations and Biases
|
118 |
|
119 |
-
While DCLM-
|
120 |
|
121 |
1. The model may exhibit biases present in its training data, which is derived from web crawl data.
|
122 |
2. It has not undergone specific alignment or safety fine-tuning, so outputs should be used with caution.
|
|
|
7 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/63118add64939fabc0108b28/BB42g4V8HTxb5dR4tcy8A.png" alt="DCLM Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
8 |
|
9 |
|
10 |
+
# Model Card for DCLM-1B
|
11 |
|
12 |
+
DCLM-1B is a 1.4 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.
|
13 |
|
14 |
## Model Details
|
15 |
|
|
|
48 |
- **Total Training Tokens:** 2.6T
|
49 |
- **Hardware:** Trained on H100 GPUs
|
50 |
|
51 |
+
For more detailed training information, please refer to Appendix P.3 of the
|
52 |
+
paper.
|
53 |
To ensure our trained model is broadly useful, including for math and coding tasks, we combine our 3.8T [DCLM-BASELINE](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) with the [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) and [ProofPile2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) data to arrive at a 4.1T token dataset.
|
54 |
|
55 |
## Evaluation
|
56 |
|
57 |
+
Here are the evaluation results for DCLM-1B on various tasks (using [llm-foundry](https://github.com/mosaicml/llm-foundry) eval suite)
|
58 |
|
59 |
| Task | Score |
|
60 |
|------------------------------------------|---------|
|
|
|
117 |
|
118 |
## Limitations and Biases
|
119 |
|
120 |
+
While DCLM-1B demonstrates strong performance across a range of tasks, it's important to note:
|
121 |
|
122 |
1. The model may exhibit biases present in its training data, which is derived from web crawl data.
|
123 |
2. It has not undergone specific alignment or safety fine-tuning, so outputs should be used with caution.
|