andersonbcdefg
/

distilmodernbert

Model card Files Files and versions Community

andersonbcdefg commited on Dec 22, 2024

Commit

dae6762

·

verified ·

1 Parent(s): 5c7d6fa

Create README.md

Files changed (1) hide show

README.md +60 -0

README.md ADDED Viewed

	@@ -0,0 +1,60 @@

+This is a version of ModernBERT-base distilled down to 16 layers out of 22. The last 6 local attention layers were removed:
+0. Global
+1. Local
+2. Local
+3. Global
+4. Local
+5. Local
+6. Global
+7. Local
+8. Local
+9. Global
+10. Local
+11. Local
+12. Global
+13. Local (REMOVED)
+14. Local (REMOVED)
+15. Global
+16. Local (REMOVED)
+17. Local (REMOVED)
+18. Global
+19. Local (REMOVED)
+20. Local (REMOVED)
+21. Global
+Unfortunately the modeling code relies on the fact that the global-local patterns are the same throughout the model,
+so loading this bad boy takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
+model configuration to allow custom striping of global+local layers. For now, here's how to do it:
+1. Download the checkpoint (model.pt) from this repository.
+2. Initialize ModernBERT-base:
+```python
+import torch.nn as nn
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+model_id = "answerdotai/ModernBERT-base"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForMaskedLM.from_pretrained(model_id)
+```
+3. Remove the layers:
+```python
+layers_to_remove = [13, 14, 16, 17, 19, 20]
+model.model.layers = nn.ModuleList([
+  layer for idx, layer in enumerate(model.model.layers)
+  if idx not in layers_to_remove
+])
+```
+4. Load the checkpoint state dict:
+```python
+state_dict = torch.load("model.pt")
+model.model.load_state_dict(state_dict)
+```
+5. Use the model! Yay!
+# Training Information
+This model was distilled from ModernBERT-base on the MiniPile dataset (JeanKaddour/minipile), which includes English and code data.
+Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits, and batch size of 16.
+The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
+I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
+"The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!