bclavie commited on
Commit
2f741b5
·
verified ·
1 Parent(s): ab2766c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -3
README.md CHANGED
@@ -1,8 +1,147 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- Coming Soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ widget:
7
+ - text: "You will be given a question and options. Select the right answer.
8
+ QUESTION: If (G, .) is a group such that (ab)^-1 = a^-1b^-1, for all a, b in G, then G is a/an
9
+ CHOICES:
10
+ - A: commutative semi group
11
+ - B: abelian group
12
+ - C: non-abelian group
13
+ - D: None of these
14
+ ANSWER: [unused0] [MASK]"
15
+ tags:
16
+ - fill-mask
17
+ - masked-lm
18
+ - long-context
19
+ - classification
20
+ - modernbert
21
+ pipeline_tag: fill-mask
22
+ inference: false
23
  ---
24
 
25
+ # ModernBERT-Large-Instruct
26
 
27
+ ## Table of Contents
28
+ 1. [Model Summary](#model-summary)
29
+ 2. [Usage](#Usage)
30
+ 3. [Evaluation](#Evaluation)
31
+ 4. [Limitations](#limitations)
32
+ 5. [Training](#training)
33
+ 6. [License](#license)
34
+ 7. [Citation](#citation)
35
+
36
+ ## Model Summary
37
+
38
+ ModernBERT-Instruct-Large is a lightly instruction-tuned version of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large), trained using a mixed-objective (Answer Token Prediction & Dummy MLM) on 20M examples sampled from the FLAN collection.
39
+
40
+ Despite a very straightforward pre-training and inference pipeline, this model proves to be a very strong model in a variety of contexts, in both zero-shot and fully-finetuned settings.
41
+
42
+ For more details, we recommend checking out the [TIL Blog Post](), the [mini cookbook GitHub repository](https://github.com/AnswerDotAI/ModernBERT-Instruct-mini-cookbook) or the [Technical Report](https://arxiv.org/abs/2502.03793).
43
+
44
+ ## Usage
45
+
46
+ In order to use ModernBERT-Large-Instruct, you need to install a version of `transformers` which natively supports ModernBERT (4.48+):
47
+
48
+ ```sh
49
+ pip install -U transformers>=4.48.0
50
+ ```
51
+
52
+ **⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**
53
+
54
+ ```bash
55
+ pip install flash-attn
56
+ ```
57
+
58
+ All tasks are then performed using the Model's Masked Language Modelling head, load via `AutoModelForMaskedLM`. Here is an example to answer an MMLU question:
59
+
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
64
+
65
+ # Load model and tokenizer
66
+ model_name = "answerdotai/ModernBERT-Large-Instruct"
67
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
68
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
69
+ if device == 'cuda':
70
+ model = AutoModelForMaskedLM.from_pretrained(model_name, attn_implementation="flash_attention_2")
71
+ else:
72
+ model = AutoModelForMaskedLM.from_pretrained(model_name)
73
+
74
+ model.to(device)
75
+
76
+ # Format input for classification or multiple choice. This is a random example from MMLU.
77
+ text = """You will be given a question and options. Select the right answer.
78
+ QUESTION: If (G, .) is a group such that (ab)^-1 = a^-1b^-1, for all a, b in G, then G is a/an
79
+ CHOICES:
80
+ - A: commutative semi group
81
+ - B: abelian group
82
+ - C: non-abelian group
83
+ - D: None of these
84
+ ANSWER: [unused0] [MASK]"""
85
+
86
+ # Get prediction
87
+ inputs = tokenizer(text, return_tensors="pt").to(device)
88
+ outputs = model(**inputs)
89
+ mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero()[0, 1]
90
+ pred_id = outputs.logits[0, mask_idx].argmax()
91
+ answer = tokenizer.decode(pred_id)
92
+ print(f"Predicted answer: {answer}") # Outputs: B
93
+ ```
94
+
95
+ ## Evaluation
96
+
97
+ Results are taken from the [technical report](https://arxiv.org/abs/2502.03793). Results for MMLU and MMLU-Pro are taken from [SmolLM2 (†)](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and the [MMLU-Pro leaderboard (‡)](https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro) whenever possible.
98
+
99
+ ### Zero-Shot
100
+
101
+ | Model | MMLU | MMLU-Pro | ADEv2 | NIS | OSE | Average |
102
+ |---------------------------|-----------|-------------|------------|----------|------------|-----------|
103
+ | **0.3-0.5B** | | | | | | |
104
+ | Tasksource-NLI | 36.08 | 16.54 | _65.17_ | 58.72 | 21.11 | _39.52_ |
105
+ | RoBERTa-Large-SST | 31.30 | 13.63 | 43.61 | 75.00 | **40.67** | 40.84 |
106
+ | UniMC | 38.48 | **18.83** | 23.29 | 73.96 | 36.88 | 38.29 |
107
+ | ModernBERT-Large-Instruct | **43.06** | 17.16 | **53.31** | **85.53**| 20.62 | **43.94**|
108
+ | SmoLM2-360M | 35.8† | 11.38‡ | - | - | - | - |
109
+ | Qwen2.5-0.5B | 33.7† | 15.93‡ | - | - | - | - |
110
+ | **1B+** | | | | | | |
111
+ | Llama3.2-1B | 45.83 | 22.6 | - | - | - | - |
112
+ | SmoLM2-1.7B | 48.44 | 18.31‡ | - | - | - | - |
113
+ | Qwen2.5-1.5B | ***59.67***| ***32.1‡***| - | - | - | - |
114
+
115
+
116
+ ### Fine-Tuned
117
+
118
+ | Model | MNLI | Yahoo! | 20ng | AGNews | SST-2 | IMDB | SST-5 | Average |
119
+ |----------------------------|-----------|---------|-----------|----------|------------|---------|-----------|----------|
120
+ | ModernBERT (cls head) | 90.8† | 77.75 | **73.96** | **95.34**| **97.1†** | 96.52 | 59.28 | 84.39 |
121
+ | ModernBERT-Large-Instruct | **91.03** | **77.88**| **73.96** | 95.24 | 96.22 | **97.2**| **61.13** | **84.67**|
122
+
123
+
124
+ ## Limitations
125
+
126
+ ModernBERT’s training data is primarily English and code, so performance is best on these languages. ModernBERT-Large-Instruct is a first version, demonstrating the strong potential of using the MLM head for downstream tasks without complex pipelines. However, it is very likely to have failure cases and it could be improved further.
127
+
128
+
129
+ ## License
130
+
131
+ Apache 2.0
132
+
133
+ ## Citation
134
+
135
+ If you use ModernBERT-Large-Instruct in your work, please cite:
136
+
137
+ ```
138
+ @misc{clavié2025itsmasksimpleinstructiontuning,
139
+ title={It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers},
140
+ author={Benjamin Clavié and Nathan Cooper and Benjamin Warner},
141
+ year={2025},
142
+ eprint={2502.03793},
143
+ archivePrefix={arXiv},
144
+ primaryClass={cs.CL},
145
+ url={https://arxiv.org/abs/2502.03793},
146
+ }
147
+ ```