This is a version of [ModernBERT-base](https://huggingface.co./answerdotai/ModernBERT-base) distilled down to 16 layers out of 22. This reduces the number of parameters from 149M to 119M; however, practically speaking, since the embedding params do not contribute greatly to latency, the effect is reducing the "trunk" of the model from 110M params to 80M params. I would expect this to reduce latency by roughly 25% (increasing throughput by roughly 33%). The last 6 local attention layers were removed: 0. Global 1. Local 2. Local 3. Global 4. Local 5. Local 6. Global 7. Local 8. Local 9. Global 10. Local 11. Local 12. Global 13. Local (REMOVED) 14. Local (REMOVED) 15. Global 16. Local (REMOVED) 17. Local (REMOVED) 18. Global 19. Local (REMOVED) 20. Local (REMOVED) 21. Global Unfortunately the HuggingFace modeling code for ModernBERT relies on global-local attention patterns being uniform throughout the model, so loading this bad boy properly takes a bit of model surgery. I hope in the future that the HuggingFace team will update this model configuration to allow custom striping of global+local layers. For now, here's how to do it: 1. Download the checkpoint (model.pt) from this repository. 2. Initialize ModernBERT-base: ```python import torch.nn as nn from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "answerdotai/ModernBERT-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) ``` 3. Remove the layers: ```python layers_to_remove = [13, 14, 16, 17, 19, 20] model.model.layers = nn.ModuleList([ layer for idx, layer in enumerate(model.model.layers) if idx not in layers_to_remove ]) ``` 4. Load the checkpoint state dict: ```python state_dict = torch.load("model.pt") model.model.load_state_dict(state_dict) ``` 5. Use the model! Yay! # Training Information This model was distilled from ModernBERT-base on the [MiniPile dataset](https://huggingface.co./datasets/JeanKaddour/minipile), which includes English and code data. Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits, batch size of 16, AdamW optimizer, and constant learning rate of 1.0e-5. The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained. I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete "The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!