andersonbcdefg commited on
Commit
dae6762
·
verified ·
1 Parent(s): 5c7d6fa

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is a version of ModernBERT-base distilled down to 16 layers out of 22. The last 6 local attention layers were removed:
2
+ 0. Global
3
+ 1. Local
4
+ 2. Local
5
+ 3. Global
6
+ 4. Local
7
+ 5. Local
8
+ 6. Global
9
+ 7. Local
10
+ 8. Local
11
+ 9. Global
12
+ 10. Local
13
+ 11. Local
14
+ 12. Global
15
+ 13. Local (REMOVED)
16
+ 14. Local (REMOVED)
17
+ 15. Global
18
+ 16. Local (REMOVED)
19
+ 17. Local (REMOVED)
20
+ 18. Global
21
+ 19. Local (REMOVED)
22
+ 20. Local (REMOVED)
23
+ 21. Global
24
+
25
+ Unfortunately the modeling code relies on the fact that the global-local patterns are the same throughout the model,
26
+ so loading this bad boy takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
27
+ model configuration to allow custom striping of global+local layers. For now, here's how to do it:
28
+
29
+ 1. Download the checkpoint (model.pt) from this repository.
30
+ 2. Initialize ModernBERT-base:
31
+ ```python
32
+ import torch.nn as nn
33
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
34
+
35
+ model_id = "answerdotai/ModernBERT-base"
36
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
37
+ model = AutoModelForMaskedLM.from_pretrained(model_id)
38
+ ```
39
+ 3. Remove the layers:
40
+ ```python
41
+ layers_to_remove = [13, 14, 16, 17, 19, 20]
42
+ model.model.layers = nn.ModuleList([
43
+ layer for idx, layer in enumerate(model.model.layers)
44
+ if idx not in layers_to_remove
45
+ ])
46
+ ```
47
+ 4. Load the checkpoint state dict:
48
+ ```python
49
+ state_dict = torch.load("model.pt")
50
+ model.model.load_state_dict(state_dict)
51
+ ```
52
+
53
+ 5. Use the model! Yay!
54
+
55
+ # Training Information
56
+ This model was distilled from ModernBERT-base on the MiniPile dataset (JeanKaddour/minipile), which includes English and code data.
57
+ Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits, and batch size of 16.
58
+ The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
59
+ I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
60
+ "The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!