loubnabnl HF staff commited on
Commit
2597a60
·
verified ·
1 Parent(s): 8b2d8a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -137
README.md CHANGED
@@ -7,16 +7,58 @@ metrics:
7
  - recall
8
  - accuracy
9
  model-index:
10
- - name: classifier-llama3-typescript-500k
11
  results: []
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # classifier-llama3-typescript-500k
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- This model is a fine-tuned version of [bigcode/starencoder](https://huggingface.co/bigcode/starencoder) on an unknown dataset.
20
  It achieves the following results on the evaluation set:
21
  - Loss: 0.3169
22
  - Precision: 0.7165
@@ -26,147 +68,43 @@ It achieves the following results on the evaluation set:
26
  - F1 Binary Minimum3: 0.5559
27
  - F1 Binary Minimum2: 0.9293
28
 
29
- ## Model description
30
-
31
- More information needed
32
-
33
- ## Intended uses & limitations
34
-
35
- More information needed
36
-
37
- ## Training and evaluation data
38
 
39
- More information needed
40
-
41
- ## Training procedure
 
 
42
 
43
  ### Training hyperparameters
44
 
45
  The following hyperparameters were used during training:
46
- - learning_rate: 0.0001
47
- - train_batch_size: 16
48
  - eval_batch_size: 256
49
  - seed: 0
50
  - distributed_type: multi-GPU
51
- - num_devices: 8
52
  - total_train_batch_size: 128
53
- - total_eval_batch_size: 2048
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
  - lr_scheduler_type: linear
56
  - lr_scheduler_warmup_steps: 200
57
- - num_epochs: 30
58
-
59
- ### Training results
60
-
61
- | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 Macro | Accuracy | F1 Binary Minimum3 | F1 Binary Minimum2 |
62
- |:-------------:|:-------:|:------:|:---------------:|:---------:|:------:|:--------:|:--------:|:------------------:|:------------------:|
63
- | No log | 0 | 0 | 4.5438 | 0.0358 | 0.2 | 0.0607 | 0.1788 | 0 | 0 |
64
- | 0.3468 | 0.2960 | 1000 | 0.3515 | 0.4698 | 0.3119 | 0.3243 | 0.6325 | 0.4623 | 0.9243 |
65
- | 0.3432 | 0.5921 | 2000 | 0.3465 | 0.5149 | 0.3365 | 0.3559 | 0.6356 | 0.5743 | 0.9252 |
66
- | 0.345 | 0.8881 | 3000 | 0.3374 | 0.5098 | 0.3361 | 0.3564 | 0.6431 | 0.5591 | 0.9264 |
67
- | 0.3487 | 1.1841 | 4000 | 0.3350 | 0.5081 | 0.3339 | 0.3557 | 0.6438 | 0.5224 | 0.9265 |
68
- | 0.3461 | 1.4802 | 5000 | 0.3331 | 0.5103 | 0.3427 | 0.3673 | 0.6455 | 0.5533 | 0.9269 |
69
- | 0.3193 | 1.7762 | 6000 | 0.3339 | 0.5122 | 0.3453 | 0.3687 | 0.6449 | 0.5696 | 0.9273 |
70
- | 0.3301 | 2.0722 | 7000 | 0.3312 | 0.5107 | 0.3492 | 0.3756 | 0.6472 | 0.5585 | 0.9270 |
71
- | 0.3246 | 2.3683 | 8000 | 0.3411 | 0.5137 | 0.3533 | 0.3783 | 0.6396 | 0.5934 | 0.9260 |
72
- | 0.3301 | 2.6643 | 9000 | 0.3362 | 0.5139 | 0.3530 | 0.3791 | 0.6438 | 0.5876 | 0.9264 |
73
- | 0.3342 | 2.9603 | 10000 | 0.3306 | 0.5019 | 0.3407 | 0.3642 | 0.6462 | 0.5157 | 0.9268 |
74
- | 0.3321 | 3.2564 | 11000 | 0.3287 | 0.5076 | 0.3521 | 0.3796 | 0.6481 | 0.5594 | 0.9275 |
75
- | 0.3434 | 3.5524 | 12000 | 0.3368 | 0.4982 | 0.3309 | 0.3501 | 0.6418 | 0.4749 | 0.9249 |
76
- | 0.3305 | 3.8484 | 13000 | 0.3297 | 0.5043 | 0.3391 | 0.3635 | 0.6467 | 0.5192 | 0.9266 |
77
- | 0.3187 | 4.1445 | 14000 | 0.3274 | 0.5044 | 0.3480 | 0.3751 | 0.6483 | 0.5470 | 0.9266 |
78
- | 0.3252 | 4.4405 | 15000 | 0.3323 | 0.5137 | 0.3585 | 0.3864 | 0.6449 | 0.5870 | 0.9273 |
79
- | 0.3316 | 4.7365 | 16000 | 0.3275 | 0.5032 | 0.3458 | 0.3716 | 0.6485 | 0.5302 | 0.9270 |
80
- | 0.3362 | 5.0326 | 17000 | 0.3305 | 0.4999 | 0.3403 | 0.3641 | 0.6452 | 0.5011 | 0.9265 |
81
- | 0.3256 | 5.3286 | 18000 | 0.3257 | 0.5044 | 0.3489 | 0.3755 | 0.6496 | 0.5446 | 0.9277 |
82
- | 0.3392 | 5.6246 | 19000 | 0.3291 | 0.4991 | 0.3463 | 0.3717 | 0.6474 | 0.5152 | 0.9266 |
83
- | 0.3264 | 5.9207 | 20000 | 0.3259 | 0.5120 | 0.3466 | 0.3738 | 0.6493 | 0.5481 | 0.9278 |
84
- | 0.3303 | 6.2167 | 21000 | 0.3251 | 0.5138 | 0.3512 | 0.3802 | 0.6496 | 0.5513 | 0.9280 |
85
- | 0.3296 | 6.5127 | 22000 | 0.3286 | 0.4984 | 0.3449 | 0.3698 | 0.6471 | 0.5119 | 0.9263 |
86
- | 0.3291 | 6.8088 | 23000 | 0.3324 | 0.5159 | 0.3661 | 0.3953 | 0.6461 | 0.5937 | 0.9279 |
87
- | 0.3222 | 7.1048 | 24000 | 0.3245 | 0.5127 | 0.3517 | 0.3806 | 0.6506 | 0.5544 | 0.9276 |
88
- | 0.3292 | 7.4008 | 25000 | 0.3251 | 0.5130 | 0.3568 | 0.3867 | 0.6505 | 0.5573 | 0.9281 |
89
- | 0.32 | 7.6969 | 26000 | 0.3245 | 0.5117 | 0.3585 | 0.3888 | 0.6505 | 0.5614 | 0.9285 |
90
- | 0.3318 | 7.9929 | 27000 | 0.3243 | 0.5097 | 0.3504 | 0.3789 | 0.6507 | 0.5360 | 0.9276 |
91
- | 0.3305 | 8.2889 | 28000 | 0.3237 | 0.5109 | 0.3536 | 0.3832 | 0.6502 | 0.5494 | 0.9280 |
92
- | 0.3423 | 8.5850 | 29000 | 0.3314 | 0.4979 | 0.3425 | 0.3662 | 0.6464 | 0.4955 | 0.9263 |
93
- | 0.3212 | 8.8810 | 30000 | 0.3236 | 0.5155 | 0.3552 | 0.3846 | 0.6509 | 0.5628 | 0.9285 |
94
- | 0.3211 | 9.1770 | 31000 | 0.3231 | 0.5130 | 0.3581 | 0.3888 | 0.6510 | 0.5587 | 0.9283 |
95
- | 0.3362 | 9.4731 | 32000 | 0.3238 | 0.5080 | 0.3541 | 0.3836 | 0.6506 | 0.5315 | 0.9280 |
96
- | 0.3305 | 9.7691 | 33000 | 0.3261 | 0.5054 | 0.3471 | 0.3737 | 0.6498 | 0.5115 | 0.9277 |
97
- | 0.3185 | 10.0651 | 34000 | 0.3232 | 0.5152 | 0.3571 | 0.3872 | 0.6520 | 0.5640 | 0.9284 |
98
- | 0.3347 | 10.3612 | 35000 | 0.3255 | 0.5044 | 0.3511 | 0.3787 | 0.6505 | 0.5154 | 0.9277 |
99
- | 0.3293 | 10.6572 | 36000 | 0.3262 | 0.7152 | 0.3651 | 0.3969 | 0.6487 | 0.5816 | 0.9283 |
100
- | 0.3291 | 10.9532 | 37000 | 0.3256 | 0.5181 | 0.3615 | 0.3918 | 0.6497 | 0.5804 | 0.9281 |
101
- | 0.3221 | 11.2493 | 38000 | 0.3239 | 0.7123 | 0.3637 | 0.3959 | 0.6491 | 0.5714 | 0.9282 |
102
- | 0.3216 | 11.5453 | 39000 | 0.3299 | 0.5013 | 0.3475 | 0.3733 | 0.6481 | 0.4941 | 0.9269 |
103
- | 0.3248 | 11.8413 | 40000 | 0.3219 | 0.5122 | 0.3551 | 0.3854 | 0.6519 | 0.5367 | 0.9283 |
104
- | 0.3285 | 12.1374 | 41000 | 0.3232 | 0.5056 | 0.3540 | 0.3829 | 0.6516 | 0.5265 | 0.9278 |
105
- | 0.3243 | 12.4334 | 42000 | 0.3260 | 0.7169 | 0.3688 | 0.4009 | 0.6493 | 0.5867 | 0.9283 |
106
- | 0.3186 | 12.7294 | 43000 | 0.3220 | 0.7092 | 0.3603 | 0.3923 | 0.6513 | 0.5507 | 0.9282 |
107
- | 0.3316 | 13.0255 | 44000 | 0.3220 | 0.5121 | 0.3544 | 0.3844 | 0.6525 | 0.5347 | 0.9286 |
108
- | 0.3157 | 13.3215 | 45000 | 0.3217 | 0.5100 | 0.3602 | 0.3910 | 0.6528 | 0.5548 | 0.9285 |
109
- | 0.3211 | 13.6175 | 46000 | 0.3226 | 0.7178 | 0.3622 | 0.3940 | 0.6524 | 0.5755 | 0.9285 |
110
- | 0.3249 | 13.9136 | 47000 | 0.3235 | 0.7053 | 0.3576 | 0.3887 | 0.6516 | 0.5287 | 0.9281 |
111
- | 0.3226 | 14.2096 | 48000 | 0.3211 | 0.7134 | 0.3587 | 0.3907 | 0.6522 | 0.5586 | 0.9279 |
112
- | 0.326 | 14.5056 | 49000 | 0.3208 | 0.7141 | 0.3632 | 0.3958 | 0.6535 | 0.5641 | 0.9284 |
113
- | 0.3211 | 14.8017 | 50000 | 0.3293 | 0.5021 | 0.3460 | 0.3722 | 0.6483 | 0.4897 | 0.9271 |
114
- | 0.3232 | 15.0977 | 51000 | 0.3207 | 0.7174 | 0.3632 | 0.3968 | 0.6536 | 0.5650 | 0.9290 |
115
- | 0.3232 | 15.3937 | 52000 | 0.3200 | 0.5125 | 0.3592 | 0.3901 | 0.6548 | 0.5483 | 0.9291 |
116
- | 0.3248 | 15.6898 | 53000 | 0.3224 | 0.5108 | 0.3540 | 0.3835 | 0.6526 | 0.5195 | 0.9287 |
117
- | 0.3132 | 15.9858 | 54000 | 0.3216 | 0.5151 | 0.3634 | 0.3944 | 0.6528 | 0.5765 | 0.9287 |
118
- | 0.3235 | 16.2818 | 55000 | 0.3216 | 0.7181 | 0.3698 | 0.4042 | 0.6526 | 0.5777 | 0.9289 |
119
- | 0.3253 | 16.5779 | 56000 | 0.3230 | 0.5082 | 0.3527 | 0.3815 | 0.6523 | 0.5142 | 0.9283 |
120
- | 0.3185 | 16.8739 | 57000 | 0.3200 | 0.5145 | 0.3576 | 0.3884 | 0.6540 | 0.5569 | 0.9285 |
121
- | 0.3268 | 17.1699 | 58000 | 0.3201 | 0.7159 | 0.3691 | 0.4037 | 0.6538 | 0.5689 | 0.9291 |
122
- | 0.3191 | 17.4660 | 59000 | 0.3207 | 0.7187 | 0.3696 | 0.4042 | 0.6543 | 0.5763 | 0.9288 |
123
- | 0.318 | 17.7620 | 60000 | 0.3194 | 0.7146 | 0.3598 | 0.3922 | 0.6544 | 0.5493 | 0.9288 |
124
- | 0.3049 | 18.0580 | 61000 | 0.3196 | 0.7099 | 0.3601 | 0.3931 | 0.6536 | 0.5355 | 0.9287 |
125
- | 0.3298 | 18.3541 | 62000 | 0.3212 | 0.5084 | 0.3563 | 0.3864 | 0.6531 | 0.5300 | 0.9285 |
126
- | 0.3257 | 18.6501 | 63000 | 0.3216 | 0.7201 | 0.3682 | 0.4025 | 0.6528 | 0.5782 | 0.9285 |
127
- | 0.3277 | 18.9461 | 64000 | 0.3188 | 0.7140 | 0.3595 | 0.3920 | 0.6540 | 0.5413 | 0.9291 |
128
- | 0.3187 | 19.2422 | 65000 | 0.3189 | 0.7147 | 0.3654 | 0.3999 | 0.6540 | 0.5593 | 0.9287 |
129
- | 0.319 | 19.5382 | 66000 | 0.3204 | 0.5114 | 0.3550 | 0.3853 | 0.6534 | 0.5199 | 0.9291 |
130
- | 0.3125 | 19.8342 | 67000 | 0.3198 | 0.5149 | 0.3602 | 0.3914 | 0.6553 | 0.5636 | 0.9286 |
131
- | 0.3114 | 20.1303 | 68000 | 0.3185 | 0.5150 | 0.3590 | 0.3903 | 0.6550 | 0.5508 | 0.9289 |
132
- | 0.3163 | 20.4263 | 69000 | 0.3187 | 0.7171 | 0.3688 | 0.4036 | 0.6550 | 0.5685 | 0.9290 |
133
- | 0.3146 | 20.7223 | 70000 | 0.3184 | 0.7171 | 0.3673 | 0.4021 | 0.6556 | 0.5613 | 0.9293 |
134
- | 0.3223 | 21.0184 | 71000 | 0.3203 | 0.5083 | 0.3570 | 0.3869 | 0.6538 | 0.5281 | 0.9287 |
135
- | 0.3209 | 21.3144 | 72000 | 0.3187 | 0.7155 | 0.3700 | 0.4050 | 0.6551 | 0.5671 | 0.9290 |
136
- | 0.3111 | 21.6104 | 73000 | 0.3182 | 0.7131 | 0.3656 | 0.3998 | 0.6552 | 0.5537 | 0.9292 |
137
- | 0.3173 | 21.9065 | 74000 | 0.3187 | 0.7184 | 0.3690 | 0.4050 | 0.6547 | 0.5688 | 0.9290 |
138
- | 0.3304 | 22.2025 | 75000 | 0.3181 | 0.7117 | 0.3628 | 0.3966 | 0.6550 | 0.5463 | 0.9293 |
139
- | 0.3235 | 22.4985 | 76000 | 0.3212 | 0.7214 | 0.3728 | 0.4089 | 0.6542 | 0.5811 | 0.9286 |
140
- | 0.3196 | 22.7946 | 77000 | 0.3179 | 0.7138 | 0.3620 | 0.3959 | 0.6550 | 0.5459 | 0.9290 |
141
- | 0.3089 | 23.0906 | 78000 | 0.3193 | 0.7196 | 0.3730 | 0.4082 | 0.6553 | 0.5781 | 0.9292 |
142
- | 0.3129 | 23.3866 | 79000 | 0.3227 | 0.6800 | 0.3785 | 0.4156 | 0.6514 | 0.5868 | 0.9288 |
143
- | 0.3149 | 23.6827 | 80000 | 0.3178 | 0.7180 | 0.3658 | 0.4005 | 0.6561 | 0.5608 | 0.9290 |
144
- | 0.3164 | 23.9787 | 81000 | 0.3179 | 0.7176 | 0.3698 | 0.4060 | 0.6557 | 0.5660 | 0.9289 |
145
- | 0.3157 | 24.2747 | 82000 | 0.3195 | 0.7200 | 0.3726 | 0.4089 | 0.6551 | 0.5771 | 0.9290 |
146
- | 0.3144 | 24.5708 | 83000 | 0.3183 | 0.7130 | 0.3612 | 0.3951 | 0.6547 | 0.5369 | 0.9293 |
147
- | 0.3131 | 24.8668 | 84000 | 0.3179 | 0.7146 | 0.3610 | 0.3949 | 0.6553 | 0.5384 | 0.9295 |
148
- | 0.3087 | 25.1628 | 85000 | 0.3172 | 0.7169 | 0.3638 | 0.3982 | 0.6559 | 0.5540 | 0.9294 |
149
- | 0.3227 | 25.4589 | 86000 | 0.3177 | 0.7176 | 0.3733 | 0.4098 | 0.6558 | 0.5698 | 0.9292 |
150
- | 0.3202 | 25.7549 | 87000 | 0.3176 | 0.7184 | 0.3659 | 0.4008 | 0.6555 | 0.5586 | 0.9291 |
151
- | 0.3279 | 26.0509 | 88000 | 0.3176 | 0.7178 | 0.3706 | 0.4071 | 0.6557 | 0.5627 | 0.9293 |
152
- | 0.3212 | 26.3470 | 89000 | 0.3175 | 0.7179 | 0.3668 | 0.4016 | 0.6554 | 0.5638 | 0.9290 |
153
- | 0.3186 | 26.6430 | 90000 | 0.3172 | 0.7150 | 0.3652 | 0.3999 | 0.6559 | 0.5497 | 0.9294 |
154
- | 0.3186 | 26.9390 | 91000 | 0.3171 | 0.7163 | 0.3648 | 0.3996 | 0.6556 | 0.5496 | 0.9293 |
155
- | 0.3133 | 27.2351 | 92000 | 0.3185 | 0.7100 | 0.3618 | 0.3953 | 0.6549 | 0.5324 | 0.9293 |
156
- | 0.3148 | 27.5311 | 93000 | 0.3176 | 0.7187 | 0.3711 | 0.4075 | 0.6561 | 0.5679 | 0.9292 |
157
- | 0.3201 | 27.8271 | 94000 | 0.3170 | 0.7173 | 0.3681 | 0.4033 | 0.6558 | 0.5587 | 0.9293 |
158
- | 0.321 | 28.1231 | 95000 | 0.3173 | 0.7141 | 0.3654 | 0.4000 | 0.6556 | 0.5476 | 0.9292 |
159
- | 0.3169 | 28.4192 | 96000 | 0.3171 | 0.7177 | 0.3682 | 0.4034 | 0.6559 | 0.5597 | 0.9294 |
160
- | 0.3231 | 28.7152 | 97000 | 0.3169 | 0.7154 | 0.3651 | 0.3998 | 0.6556 | 0.5523 | 0.9293 |
161
- | 0.3181 | 29.0112 | 98000 | 0.3169 | 0.7164 | 0.3672 | 0.4022 | 0.6556 | 0.5572 | 0.9293 |
162
- | 0.3261 | 29.3073 | 99000 | 0.3173 | 0.7181 | 0.3700 | 0.4063 | 0.6560 | 0.5659 | 0.9291 |
163
- | 0.3181 | 29.6033 | 100000 | 0.3170 | 0.7177 | 0.3695 | 0.4058 | 0.6558 | 0.5615 | 0.9292 |
164
- | 0.3149 | 29.8993 | 101000 | 0.3169 | 0.7165 | 0.3667 | 0.4017 | 0.6556 | 0.5559 | 0.9293 |
165
-
166
-
167
- ### Framework versions
168
-
169
- - Transformers 4.43.4
170
- - Pytorch 2.4.0+cu121
171
- - Datasets 2.21.0
172
- - Tokenizers 0.19.1
 
7
  - recall
8
  - accuracy
9
  model-index:
10
+ - name: stack-edu-classifier-typescript
11
  results: []
12
+ language:
13
+ - code
14
+ library_name: transformers
15
  ---
16
 
17
+ # stack-edu-classifier-typescript
 
18
 
19
+ This is a classifier for scoring the educational value of code files in The Stack v2 dataset, it is a finetuned version of [bigcode/starencoder](https://huggingface.co/bigcode/starencoder) with a classification head on code files annotated by Llama3.1-70B-Instruct. We use this classifier for building Stack-Edu dataset used for training SmolLM2, see [paper](https://arxiv.org/pdf/2502.02737). Each classifier is trained on one programming language.
20
+
21
+ ### How to use in transformers
22
+ To load the classifier, use the following code:
23
+
24
+ ```python
25
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained(REPO_NAME)
28
+ model = AutoModelForSequenceClassification.from_pretrained(REPO_NAME)
29
+
30
+ text = "This is a test sentence."
31
+ inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
32
+ outputs = model(**inputs)
33
+ logits = outputs.logits.squeeze(-1).float().detach().numpy()
34
+ score = logits.item()
35
+ result = {
36
+ "text": text,
37
+ "score": score,
38
+ "int_score": int(round(max(0, min(score, 5)))),
39
+ }
40
+
41
+ print(result)
42
+ # {'text': 'This is a test sentence.', 'score': 0.07964489609003067, 'int_score': 0}
43
+ ```
44
+
45
+ ## Intended uses & limitations
46
+
47
+ While the classifier performs well in distinguishing high-quality code in its target language (TS in this case), there are some limitations:
48
+
49
+ - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. The classifier's context is 1024 tokens, which might not be sufficient to assess the quality of some long code files.
50
+ - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to thoroughly commented code.
51
+ - Context: The classifier evaluates individual code files without considering broader context, which might impact its effectiveness in certain scenarios.
52
+
53
+ The training and inference code is available on GitHub
54
+ https://github.com/huggingface/cosmopedia/tree/main/classification
55
+
56
+ ## Training procedure
57
+
58
+ The classifier was trained on 500,000 pairs of code files and their scores from 0 to 5, generated by Llama3.1. The samples were annotated based on their educational quality with 1 being not educational and 5 being highly educational and relevant for teaching programming. You can find the prompt used for building the annotations in the appendix of [SmolLM2 paper](https://arxiv.org/pdf/2502.02737). For Markdown we ask the LLM to judge both the structure but also the educational quality of the text content.
59
+
60
+ We added a classification head with a single regression output to StarEncoder and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head.
61
 
 
62
  It achieves the following results on the evaluation set:
63
  - Loss: 0.3169
64
  - Precision: 0.7165
 
68
  - F1 Binary Minimum3: 0.5559
69
  - F1 Binary Minimum2: 0.9293
70
 
71
+ While the macro F1 scores across the 1-5 rating scale are relatively low due to the model's difficulty in distinguishing between higher-rated samples, the classifier performs well for our primary filtering task. When converting to binary classification, using a threshold of 2 achieves the F1 scores ranges between 0.8 and 0.9 for most Stack-Edu classifiers, whereas a threshold of 3 yields F1 scores between 0.5 and 0.8. With the Highest being Python, SQL, C, Rust and the lowest being TypeScript and PHP.
 
 
 
 
 
 
 
 
72
 
73
+ <div style="display: flex; justify-content: center; gap: 20px;">
74
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/VYh1deFc8Jif7B4kDpndc.png" width="600">
75
+ </div>
76
+ We validated these classifiers by filtering Stack v2 data and testing on an intermediate SmolLM2 checkpoint. Filtering with a threshold of 3 improved performance across most languages while maintaining adequate data volume, though Java showed better results with a threshold of 2.
77
+ We didn't do an evaluation on Markdown, but based on manual inspection we found threshold 3 to be good at filtering low quality content.
78
 
79
  ### Training hyperparameters
80
 
81
  The following hyperparameters were used during training:
82
+ - learning_rate: 0.0003
83
+ - train_batch_size: 64
84
  - eval_batch_size: 256
85
  - seed: 0
86
  - distributed_type: multi-GPU
87
+ - num_devices: 2
88
  - total_train_batch_size: 128
89
+ - total_eval_batch_size: 512
90
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
91
  - lr_scheduler_type: linear
92
  - lr_scheduler_warmup_steps: 200
93
+ - num_epochs: 20
94
+
95
+ ## License
96
+
97
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
98
+
99
+ ## Citation
100
+ ```bash
101
+ @misc{allal2025smollm2smolgoesbig,
102
+ title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
103
+ author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
104
+ year={2025},
105
+ eprint={2502.02737},
106
+ archivePrefix={arXiv},
107
+ primaryClass={cs.CL},
108
+ url={https://arxiv.org/abs/2502.02737},
109
+ }
110
+ ```