andersonbcdefg
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
This is a version of ModernBERT-base distilled down to 16 layers out of 22. The last 6 local attention layers were removed:
|
2 |
+
0. Global
|
3 |
+
1. Local
|
4 |
+
2. Local
|
5 |
+
3. Global
|
6 |
+
4. Local
|
7 |
+
5. Local
|
8 |
+
6. Global
|
9 |
+
7. Local
|
10 |
+
8. Local
|
11 |
+
9. Global
|
12 |
+
10. Local
|
13 |
+
11. Local
|
14 |
+
12. Global
|
15 |
+
13. Local (REMOVED)
|
16 |
+
14. Local (REMOVED)
|
17 |
+
15. Global
|
18 |
+
16. Local (REMOVED)
|
19 |
+
17. Local (REMOVED)
|
20 |
+
18. Global
|
21 |
+
19. Local (REMOVED)
|
22 |
+
20. Local (REMOVED)
|
23 |
+
21. Global
|
24 |
+
|
25 |
+
Unfortunately the modeling code relies on the fact that the global-local patterns are the same throughout the model,
|
26 |
+
so loading this bad boy takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
|
27 |
+
model configuration to allow custom striping of global+local layers. For now, here's how to do it:
|
28 |
+
|
29 |
+
1. Download the checkpoint (model.pt) from this repository.
|
30 |
+
2. Initialize ModernBERT-base:
|
31 |
+
```python
|
32 |
+
import torch.nn as nn
|
33 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
34 |
+
|
35 |
+
model_id = "answerdotai/ModernBERT-base"
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
37 |
+
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
38 |
+
```
|
39 |
+
3. Remove the layers:
|
40 |
+
```python
|
41 |
+
layers_to_remove = [13, 14, 16, 17, 19, 20]
|
42 |
+
model.model.layers = nn.ModuleList([
|
43 |
+
layer for idx, layer in enumerate(model.model.layers)
|
44 |
+
if idx not in layers_to_remove
|
45 |
+
])
|
46 |
+
```
|
47 |
+
4. Load the checkpoint state dict:
|
48 |
+
```python
|
49 |
+
state_dict = torch.load("model.pt")
|
50 |
+
model.model.load_state_dict(state_dict)
|
51 |
+
```
|
52 |
+
|
53 |
+
5. Use the model! Yay!
|
54 |
+
|
55 |
+
# Training Information
|
56 |
+
This model was distilled from ModernBERT-base on the MiniPile dataset (JeanKaddour/minipile), which includes English and code data.
|
57 |
+
Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits, and batch size of 16.
|
58 |
+
The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
|
59 |
+
I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
|
60 |
+
"The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!
|