OlivierDehaene commited on
Commit
9b6553f
1 Parent(s): 157b030
README.md ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: openrail
3
+ datasets:
4
+ - bigcode/the-stack
5
+ language:
6
+ - code
7
+ programming_language:
8
+ - Java
9
+ - JavaScript
10
+ - Python
11
+ pipeline_tag: text-generation
12
+ inference: false
13
+ widget:
14
+ - text: 'def print_hello_world():'
15
+ example_title: Hello world
16
+ group: Python
17
+
18
+ model-index:
19
+ - name: SantaCoder
20
+ results:
21
+ - task:
22
+ type: text-generation
23
+ dataset:
24
+ type: nuprl/MultiPL-E
25
+ name: MultiPL HumanEval (Python)
26
+ metrics:
27
+ - name: pass@1
28
+ type: pass@1
29
+ value: 0.18
30
+ verified: false
31
+ - name: pass@10
32
+ type: pass@10
33
+ value: 0.29
34
+ verified: false
35
+ - name: pass@100
36
+ type: pass@100
37
+ value: 0.49
38
+ verified: false
39
+ - task:
40
+ type: text-generation
41
+ dataset:
42
+ type: nuprl/MultiPL-E
43
+ name: MultiPL MBPP (Python)
44
+ metrics:
45
+ - name: pass@1
46
+ type: pass@1
47
+ value: 0.35
48
+ verified: false
49
+ - name: pass@10
50
+ type: pass@10
51
+ value: 0.58
52
+ verified: false
53
+ - name: pass@100
54
+ type: pass@100
55
+ value: 0.77
56
+ verified: false
57
+ - task:
58
+ type: text-generation
59
+ dataset:
60
+ type: nuprl/MultiPL-E
61
+ name: MultiPL HumanEval (JavaScript)
62
+ metrics:
63
+ - name: pass@1
64
+ type: pass@1
65
+ value: 0.16
66
+ verified: false
67
+ - name: pass@10
68
+ type: pass@10
69
+ value: 0.27
70
+ verified: false
71
+ - name: pass@100
72
+ type: pass@100
73
+ value: 0.47
74
+ verified: false
75
+ - task:
76
+ type: text-generation
77
+ dataset:
78
+ type: nuprl/MultiPL-E
79
+ name: MultiPL MBPP (Javascript)
80
+ metrics:
81
+ - name: pass@1
82
+ type: pass@1
83
+ value: 0.28
84
+ verified: false
85
+ - name: pass@10
86
+ type: pass@10
87
+ value: 0.51
88
+ verified: false
89
+ - name: pass@100
90
+ type: pass@100
91
+ value: 0.70
92
+ verified: false
93
+ - task:
94
+ type: text-generation
95
+ dataset:
96
+ type: nuprl/MultiPL-E
97
+ name: MultiPL HumanEval (Java)
98
+ metrics:
99
+ - name: pass@1
100
+ type: pass@1
101
+ value: 0.15
102
+ verified: false
103
+ - name: pass@10
104
+ type: pass@10
105
+ value: 0.26
106
+ verified: false
107
+ - name: pass@100
108
+ type: pass@100
109
+ value: 0.41
110
+ verified: false
111
+ - task:
112
+ type: text-generation
113
+ dataset:
114
+ type: nuprl/MultiPL-E
115
+ name: MultiPL MBPP (Java)
116
+ metrics:
117
+ - name: pass@1
118
+ type: pass@1
119
+ value: 0.28
120
+ verified: false
121
+ - name: pass@10
122
+ type: pass@10
123
+ value: 0.44
124
+ verified: false
125
+ - name: pass@100
126
+ type: pass@100
127
+ value: 0.59
128
+ verified: false
129
+ - task:
130
+ type: text-generation
131
+ dataset:
132
+ type: loubnabnl/humaneval_infilling
133
+ name: HumanEval FIM (Python)
134
+ metrics:
135
+ - name: single_line
136
+ type: exact_match
137
+ value: 0.44
138
+ verified: false
139
+ - task:
140
+ type: text-generation
141
+ dataset:
142
+ type: nuprl/MultiPL-E
143
+ name: MultiPL HumanEval FIM (Java)
144
+ metrics:
145
+ - name: single_line
146
+ type: exact_match
147
+ value: 0.62
148
+ verified: false
149
+ - task:
150
+ type: text-generation
151
+ dataset:
152
+ type: nuprl/MultiPL-E
153
+ name: MultiPL HumanEval FIM (JavaScript)
154
+ metrics:
155
+ - name: single_line
156
+ type: exact_match
157
+ value: 0.60
158
+ verified: false
159
+ - task:
160
+ type: text-generation
161
+ dataset:
162
+ type: code_x_glue_ct_code_to_text
163
+ name: CodeXGLUE code-to-text (Python)
164
+ metrics:
165
+ - name: BLEU
166
+ type: bleu
167
+ value: 18.13
168
+ verified: false
169
+ ---
170
+
171
+ # SantaCoder
172
+
173
+ ![banner](https://huggingface.co/datasets/bigcode/admin/resolve/main/banner.png)
174
+
175
+ Play with the model on the [SantaCoder Space Demo](https://huggingface.co/spaces/bigcode/santacoder-demo).
176
+
177
+ # Table of Contents
178
+
179
+ 1. [Model Summary](#model-summary)
180
+ 2. [Use](#use)
181
+ 3. [Limitations](#limitations)
182
+ 4. [Training](#training)
183
+ 5. [License](#license)
184
+ 6. [Citation](#citation)
185
+
186
+ # Model Summary
187
+
188
+ The SantaCoder models are a series of 1.1B parameter models trained on the Python, Java, and JavaScript subset of [The Stack (v1.1)](https://huggingface.co/datasets/bigcode/the-stack) (which excluded opt-out requests).
189
+ The main model uses [Multi Query Attention](https://arxiv.org/abs/1911.02150), was trained using near-deduplication and comment-to-code ratio as filtering criteria and using the [Fill-in-the-Middle objective](https://arxiv.org/abs/2207.14255).
190
+ In addition there are several models that were trained on datasets with different filter parameters and with architecture and objective variations.
191
+
192
+ - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
193
+ - **Project Website:** [bigcode-project.org](www.bigcode-project.org)
194
+ - **Paper:** [🎅SantaCoder: Don't reach for the stars!🌟](https://t.co/YV3pzUbYOr)
195
+ - **Point of Contact:** [[email protected]](mailto:[email protected])
196
+ - **Languages:** Python, Java, and JavaScript
197
+
198
+ |Model|Architecture|Objective|Filtering|
199
+ |:-|:-|:-|:-|
200
+ |`mha`|MHA|AR + FIM| Base |
201
+ |`no-fim`| MQA | AR| Base |
202
+ |`fim`| MQA | AR + FIM | Base |
203
+ |`stars`| MQA | AR + FIM | GitHub stars |
204
+ |`fertility`| MQA | AR + FIM | Tokenizer fertility |
205
+ |`comments`| MQA | AR + FIM | Comment-to-code ratio |
206
+ |`dedup-alt`| MQA | AR + FIM | Stronger near-deduplication |
207
+ |`final`| MQA | AR + FIM | Stronger near-deduplication and comment-to-code ratio |
208
+
209
+ The `final` model is the best performing model and was trained twice as long (236B tokens) as the others. This checkpoint is the default model and available on the `main` branch. All other checkpoints are on separate branches with according names.
210
+
211
+ # Use
212
+
213
+ ## Intended use
214
+
215
+ The model was trained on GitHub code. As such it is _not_ an instruction model and commands like "Write a function that computes the square root." do not work well.
216
+ You should phrase commands like they occur in source code such as comments (e.g. `# the following function computes the sqrt`) or write a function signature and docstring and let the model complete the function body.
217
+
218
+ **Feel free to share your generations in the Community tab!**
219
+
220
+ ## How to use
221
+
222
+ ### Generation
223
+ ```python
224
+ # pip install -q transformers
225
+ from transformers import AutoModelForCausalLM, AutoTokenizer
226
+
227
+ checkpoint = "bigcode/santacoder"
228
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
229
+
230
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
231
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
232
+
233
+ inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
234
+ outputs = model.generate(inputs)
235
+ print(tokenizer.decode(outputs[0]))
236
+ ```
237
+
238
+ ### Fill-in-the-middle
239
+ Fill-in-the-middle uses special tokens to identify the prefix/middle/suffic part of the input and output:
240
+
241
+ ```python
242
+ input_text = "<fim-prefix>def print_hello_world():\n <fim-suffix>\n print('Hello world!')<fim-middle>"
243
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
244
+ outputs = model.generate(inputs)
245
+ print(tokenizer.decode(outputs[0]))
246
+ ```
247
+
248
+ ### Load other checkpoints
249
+ We upload the checkpoint of each experiment to a seperate branch as well as the intermediate checkpoints as commits on the branches. You can load them with the `revision` flag:
250
+
251
+ ```python
252
+ model = AutoModelForCausalLM.from_pretrained(
253
+ "bigcode/santacoder",
254
+ revision="no-fim", # name of branch or commit hash
255
+ trust_remote_code=True
256
+ )
257
+ ```
258
+
259
+ ### Attribution & Other Requirements
260
+
261
+ The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. We provide a [search index](https://huggingface.co/spaces/bigcode/santacoder-search) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
262
+
263
+ # Limitations
264
+
265
+ The model has been trained on source code in Python, Java, and JavaScript. The predominant language in source is English although other languages are also present. As such the model is capable to generate code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits.
266
+
267
+ # Training
268
+
269
+ ## Model
270
+
271
+ - **Architecture:** GPT-2 model with multi-query attention and Fill-in-the-Middle objective
272
+ - **Pretraining steps:** 600K
273
+ - **Pretraining tokens:** 236 billion
274
+ - **Precision:** float16
275
+
276
+ ## Hardware
277
+
278
+ - **GPUs:** 96 Tesla V100
279
+ - **Training time:** 6.2 days
280
+ - **Total FLOPS:** 2.1 x 10e21
281
+
282
+ ## Software
283
+
284
+ - **Orchestration:** [Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
285
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
286
+ - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
287
+
288
+ # License
289
+ The model is licenses under the CodeML Open RAIL-M v0.1 license. You can find the full license [here](https://huggingface.co/spaces/bigcode/license).
290
+
291
+ # Citation
292
+ **TODO**
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bigcode/santacoder",
3
+ "activation_function": "gelu_fast",
4
+ "architectures": [
5
+ "GPT2LMHeadCustomModel"
6
+ ],
7
+ "attention_head_type": "multiquery",
8
+ "attn_pdrop": 0.1,
9
+ "auto_map": {
10
+ "AutoConfig": "configuration_gpt2_mq.GPT2CustomConfig",
11
+ "AutoModelForCausalLM": "modeling_gpt2_mq.GPT2LMHeadCustomModel"
12
+ },
13
+ "bos_token_id": 50256,
14
+ "embd_pdrop": 0.1,
15
+ "eos_token_id": 50256,
16
+ "initializer_range": 0.02,
17
+ "layer_norm_epsilon": 1e-05,
18
+ "model_type": "gpt2",
19
+ "n_embd": 2048,
20
+ "n_head": 16,
21
+ "n_inner": 8192,
22
+ "n_layer": 24,
23
+ "n_positions": 2048,
24
+ "reorder_and_upcast_attn": false,
25
+ "resid_pdrop": 0.1,
26
+ "scale_attn_by_inverse_layer_idx": false,
27
+ "scale_attn_weights": true,
28
+ "summary_activation": null,
29
+ "summary_first_dropout": 0.1,
30
+ "summary_proj_to_labels": true,
31
+ "summary_type": "cls_index",
32
+ "summary_use_proj": true,
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.25.1",
35
+ "use_cache": true,
36
+ "vocab_size": 49280
37
+ }
configuration_gpt2_mq.py ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2018 The OpenAI Team Authors and Hugging Face Inc. team.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ Custom GPT-2 configuration"""
17
+ from collections import OrderedDict
18
+ from typing import Any, List, Mapping, Optional
19
+ from enum import Enum
20
+
21
+ from transformers import PreTrainedTokenizer, TensorType, is_torch_available
22
+
23
+ from transformers.configuration_utils import PretrainedConfig
24
+ from transformers.onnx import OnnxConfigWithPast, PatchingSpec
25
+ from transformers.utils import logging
26
+
27
+
28
+ logger = logging.get_logger(__name__)
29
+
30
+ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
31
+ "gpt2": "https://huggingface.co/gpt2/resolve/main/config.json",
32
+ "gpt2-medium": "https://huggingface.co/gpt2-medium/resolve/main/config.json",
33
+ "gpt2-large": "https://huggingface.co/gpt2-large/resolve/main/config.json",
34
+ "gpt2-xl": "https://huggingface.co/gpt2-xl/resolve/main/config.json",
35
+ "distilgpt2": "https://huggingface.co/distilgpt2/resolve/main/config.json",
36
+ }
37
+
38
+ MULTI_HEAD = "multihead"
39
+ MULTI_QUERY = "multiquery"
40
+
41
+
42
+ class GPT2CustomConfig(PretrainedConfig):
43
+ """
44
+ This is the configuration class to store the configuration of a [`GPT2Model`] or a [`TFGPT2Model`]. It is used to
45
+ instantiate a GPT-2 model according to the specified arguments, defining the model architecture. Instantiating a
46
+ configuration with the defaults will yield a similar configuration to that of the GPT-2
47
+ [gpt2](https://huggingface.co/gpt2) architecture.
48
+
49
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
50
+ documentation from [`PretrainedConfig`] for more information.
51
+
52
+
53
+ Args:
54
+ vocab_size (`int`, *optional*, defaults to 50257):
55
+ Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
56
+ `inputs_ids` passed when calling [`GPT2Model`] or [`TFGPT2Model`].
57
+ n_positions (`int`, *optional*, defaults to 1024):
58
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
59
+ just in case (e.g., 512 or 1024 or 2048).
60
+ n_embd (`int`, *optional*, defaults to 768):
61
+ Dimensionality of the embeddings and hidden states.
62
+ n_layer (`int`, *optional*, defaults to 12):
63
+ Number of hidden layers in the Transformer encoder.
64
+ n_head (`int`, *optional*, defaults to 12):
65
+ Number of attention heads for each attention layer in the Transformer encoder.
66
+ n_inner (`int`, *optional*, defaults to None):
67
+ Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
68
+ activation_function (`str`, *optional*, defaults to `"gelu"`):
69
+ Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
70
+ resid_pdrop (`float`, *optional*, defaults to 0.1):
71
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
72
+ embd_pdrop (`int`, *optional*, defaults to 0.1):
73
+ The dropout ratio for the embeddings.
74
+ attn_pdrop (`float`, *optional*, defaults to 0.1):
75
+ The dropout ratio for the attention.
76
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
77
+ The epsilon to use in the layer normalization layers.
78
+ initializer_range (`float`, *optional*, defaults to 0.02):
79
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
80
+ summary_type (`string`, *optional*, defaults to `"cls_index"`):
81
+ Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and
82
+ [`TFGPT2DoubleHeadsModel`].
83
+
84
+ Has to be one of the following options:
85
+
86
+ - `"last"`: Take the last token hidden state (like XLNet).
87
+ - `"first"`: Take the first token hidden state (like BERT).
88
+ - `"mean"`: Take the mean of all tokens hidden states.
89
+ - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
90
+ - `"attn"`: Not implemented now, use multi-head attention.
91
+ summary_use_proj (`bool`, *optional*, defaults to `True`):
92
+ Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and
93
+ [`TFGPT2DoubleHeadsModel`].
94
+
95
+ Whether or not to add a projection after the vector extraction.
96
+ summary_activation (`str`, *optional*):
97
+ Argument used when doing sequence summary. Used in for the multiple choice head in
98
+ [`GPT2DoubleHeadsModel`].
99
+
100
+ Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
101
+ summary_proj_to_labels (`bool`, *optional*, defaults to `True`):
102
+ Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and
103
+ [`TFGPT2DoubleHeadsModel`].
104
+
105
+ Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
106
+ summary_first_dropout (`float`, *optional*, defaults to 0.1):
107
+ Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and
108
+ [`TFGPT2DoubleHeadsModel`].
109
+
110
+ The dropout ratio to be used after the projection and activation.
111
+ scale_attn_weights (`bool`, *optional*, defaults to `True`):
112
+ Scale attention weights by dividing by sqrt(head_dim)..
113
+ use_cache (`bool`, *optional*, defaults to `True`):
114
+ Whether or not the model should return the last key/values attentions (not used by all models).
115
+ scale_attn_by_inverse_layer_idx (`bool`, *optional*, defaults to `False`):
116
+ Whether to additionally scale attention weights by `1 / layer_idx + 1`.
117
+ reorder_and_upcast_attn (`bool`, *optional*, defaults to `False`):
118
+ Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention
119
+ dot-product/softmax to float() when training with mixed precision.
120
+
121
+ Example:
122
+
123
+ ```python
124
+ >>> from transformers import GPT2Config, GPT2Model
125
+
126
+ >>> # Initializing a GPT2 configuration
127
+ >>> configuration = GPT2Config()
128
+
129
+ >>> # Initializing a model (with random weights) from the configuration
130
+ >>> model = GPT2Model(configuration)
131
+
132
+ >>> # Accessing the model configuration
133
+ >>> configuration = model.config
134
+ ```"""
135
+
136
+ model_type = "gpt2"
137
+ keys_to_ignore_at_inference = ["past_key_values"]
138
+ attribute_map = {
139
+ "hidden_size": "n_embd",
140
+ "max_position_embeddings": "n_positions",
141
+ "num_attention_heads": "n_head",
142
+ "num_hidden_layers": "n_layer",
143
+ }
144
+
145
+ def __init__(
146
+ self,
147
+ vocab_size=50257,
148
+ n_positions=1024,
149
+ n_embd=768,
150
+ n_layer=12,
151
+ n_head=12,
152
+ n_inner=None,
153
+ activation_function="gelu_new",
154
+ resid_pdrop=0.1,
155
+ embd_pdrop=0.1,
156
+ attn_pdrop=0.1,
157
+ layer_norm_epsilon=1e-5,
158
+ initializer_range=0.02,
159
+ summary_type="cls_index",
160
+ summary_use_proj=True,
161
+ summary_activation=None,
162
+ summary_proj_to_labels=True,
163
+ summary_first_dropout=0.1,
164
+ scale_attn_weights=True,
165
+ use_cache=True,
166
+ bos_token_id=50256,
167
+ eos_token_id=50256,
168
+ scale_attn_by_inverse_layer_idx=False,
169
+ reorder_and_upcast_attn=False,
170
+ attention_head_type=MULTI_HEAD,
171
+ **kwargs,
172
+ ):
173
+ self.vocab_size = vocab_size
174
+ self.n_positions = n_positions
175
+ self.n_embd = n_embd
176
+ self.n_layer = n_layer
177
+ self.n_head = n_head
178
+ self.n_inner = n_inner
179
+ self.activation_function = activation_function
180
+ self.resid_pdrop = resid_pdrop
181
+ self.embd_pdrop = embd_pdrop
182
+ self.attn_pdrop = attn_pdrop
183
+ self.layer_norm_epsilon = layer_norm_epsilon
184
+ self.initializer_range = initializer_range
185
+ self.summary_type = summary_type
186
+ self.summary_use_proj = summary_use_proj
187
+ self.summary_activation = summary_activation
188
+ self.summary_first_dropout = summary_first_dropout
189
+ self.summary_proj_to_labels = summary_proj_to_labels
190
+ self.scale_attn_weights = scale_attn_weights
191
+ self.use_cache = use_cache
192
+ self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
193
+ self.reorder_and_upcast_attn = reorder_and_upcast_attn
194
+ self.attention_head_type = attention_head_type
195
+ # assert attention_head_type in [AttentionType.MULTI_HEAD, AttentionType.MULTI_QUERY]
196
+ assert attention_head_type in [MULTI_HEAD, MULTI_QUERY]
197
+
198
+ self.bos_token_id = bos_token_id
199
+ self.eos_token_id = eos_token_id
200
+
201
+ super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba58d7bbc20355cd3083e789a88fa6b9016ec36ffaf113e94df03d1449ecadf6
3
+ size 4903283827
modeling_gpt2_mq.py ADDED
@@ -0,0 +1,498 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """PyTorch OpenAI GPT-2 model modified with MultiQuery attention"""
2
+
3
+ from typing import Optional, Tuple, Union
4
+
5
+ import math
6
+ import torch
7
+ import torch.utils.checkpoint
8
+ from torch import nn
9
+
10
+ from transformers.activations import ACT2FN
11
+ from transformers.modeling_outputs import (
12
+ BaseModelOutputWithPastAndCrossAttentions,
13
+ )
14
+ from transformers.models.gpt2.modeling_gpt2 import GPT2Model, GPT2Block, GPT2PreTrainedModel, GPT2LMHeadModel
15
+ from transformers.utils import logging
16
+ from configuration_gpt2_mq import GPT2CustomConfig, MULTI_QUERY
17
+
18
+ logger = logging.get_logger(__name__)
19
+
20
+
21
+ def make_causal_mask(
22
+ input_ids_shape: torch.Size, device: torch.device, past_key_values_length: int
23
+ ) -> torch.BoolTensor:
24
+ """
25
+ Make causal mask used for self-attention.
26
+ """
27
+ batch_size, target_length = input_ids_shape
28
+ mask = torch.empty((target_length, target_length + past_key_values_length), dtype=torch.bool, device=device)
29
+ # ONNX doesn't support `torch.Tensor.triu` properly, thus we use this workaround
30
+ seq_ids = torch.arange(target_length, device=device)
31
+ mask[:, past_key_values_length:] = seq_ids[:, None] < seq_ids[None, :]
32
+
33
+ if past_key_values_length > 0:
34
+ mask[:, :past_key_values_length] = False
35
+
36
+ expanded_mask = mask[None, :, :].expand(batch_size, target_length, target_length + past_key_values_length)
37
+ return expanded_mask
38
+
39
+
40
+ def expand_mask(mask: torch.Tensor, tgt_length: int) -> torch.BoolTensor:
41
+ """
42
+ Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`.
43
+ """
44
+ batch_size, src_length = mask.shape
45
+ tgt_length = tgt_length if tgt_length is not None else src_length
46
+
47
+ expanded_mask = ~(mask[:, None, :].to(torch.bool))
48
+ return expanded_mask.expand(batch_size, tgt_length, src_length)
49
+
50
+
51
+ def prepare_attn_mask(
52
+ attention_mask: torch.Tensor, input_shape: Tuple[int, int], past_key_values_length: int
53
+ ) -> torch.BoolTensor:
54
+ # create causal mask
55
+ # [batch_size, seq_length] -> [batch_size, tgt_length, src_length]
56
+ combined_attention_mask = None
57
+ device = attention_mask.device
58
+ _, src_length = input_shape
59
+
60
+ if src_length > 1:
61
+ combined_attention_mask = make_causal_mask(
62
+ input_shape, device=device, past_key_values_length=past_key_values_length
63
+ )
64
+
65
+ # [batch_size, seq_length] -> [batch_size, tgt_length, src_length]
66
+ expanded_attn_mask = expand_mask(attention_mask, tgt_length=src_length)
67
+ combined_attention_mask = (
68
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
69
+ )
70
+
71
+ return combined_attention_mask
72
+
73
+
74
+ @torch.jit.script
75
+ def gelu_forward(x: torch.Tensor) -> torch.Tensor:
76
+ """
77
+ Custom bias GELU function. Adapted from Megatron-DeepSpeed code. Here we use a simple implementation (inference) to
78
+ make the model jitable.
79
+
80
+ Args:
81
+ x (`torch.tensor`, *required*):
82
+ input hidden states
83
+ """
84
+ return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
85
+
86
+
87
+ class LinearGPT2MLP(nn.Module):
88
+ def __init__(self, intermediate_size, config):
89
+ super().__init__()
90
+ embed_dim = config.hidden_size
91
+ self.c_fc = nn.Linear(embed_dim, intermediate_size)
92
+ self.c_proj = nn.Linear(intermediate_size, embed_dim)
93
+ self.act = ACT2FN[config.activation_function] if "gelu" not in config.activation_function else gelu_forward
94
+ self.dropout = nn.Dropout(config.resid_pdrop)
95
+
96
+ def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.FloatTensor:
97
+ hidden_states = self.c_fc(hidden_states)
98
+ hidden_states = self.act(hidden_states)
99
+ hidden_states = self.c_proj(hidden_states)
100
+ hidden_states = self.dropout(hidden_states)
101
+ return hidden_states
102
+
103
+
104
+ class GPT2MQAttention(nn.Module):
105
+ def __init__(self, config, is_cross_attention=False, layer_idx=None):
106
+ super().__init__()
107
+ assert config.attention_head_type == MULTI_QUERY
108
+
109
+ self.embed_dim = config.hidden_size
110
+ self.num_heads = config.num_attention_heads
111
+ self.head_dim = self.embed_dim // self.num_heads
112
+ self.split_size = self.embed_dim
113
+ if self.head_dim * self.num_heads != self.embed_dim:
114
+ raise ValueError(
115
+ f"`embed_dim` must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
116
+ f" {self.num_heads})."
117
+ )
118
+
119
+ self.scale_attn_weights = config.scale_attn_weights
120
+ if is_cross_attention:
121
+ raise NotImplementedError("Cross-attention not implemented for MQA")
122
+ self.is_cross_attention = is_cross_attention
123
+
124
+ # Layer-wise attention scaling, reordering, and upcasting
125
+ self.scale_attn_by_inverse_layer_idx = config.scale_attn_by_inverse_layer_idx
126
+ self.layer_idx = layer_idx
127
+ self.reorder_and_upcast_attn = config.reorder_and_upcast_attn
128
+
129
+ if self.is_cross_attention:
130
+ raise NotImplementedError("Cross-attention not implemented for MQA")
131
+ else:
132
+ # self.c_attn = Conv1D(3 * self.embed_dim, self.embed_dim)
133
+ self.q_attn = nn.Linear(self.embed_dim, self.embed_dim)
134
+ # Keys and values are shared across heads
135
+ self.kv_attn = nn.Linear(self.embed_dim, 2 * self.head_dim)
136
+ self.c_proj = nn.Linear(self.embed_dim, self.embed_dim)
137
+
138
+ self.attn_dropout = nn.Dropout(config.attn_pdrop)
139
+ self.resid_dropout = nn.Dropout(config.resid_pdrop)
140
+
141
+ self.pruned_heads = set()
142
+ self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
143
+
144
+ def _attn(self, query, key, value, attention_mask=None, head_mask=None):
145
+ # query: (b, num_heads * sq, head_dim)
146
+ # key: (b, head_dim, sk)
147
+ # value: (b, sk, head_dim)
148
+ batch_size = query.size(0)
149
+ query_length = query.size(1) // self.num_heads
150
+ key_length = key.size(2)
151
+ # (b, num_heads * sq, head_dim) x (b, head_dim, sk) -> (b, num_heads * sq, sk)
152
+
153
+ if self.scale_attn_weights:
154
+ query *= self.inv_norm_factor
155
+
156
+ attn_weights = torch.bmm(query, key)
157
+
158
+ # -> (b, num_heads, sq, sk)
159
+ attn_weights = attn_weights.view(batch_size, self.num_heads, query_length, key_length)
160
+
161
+ # Layer-wise attention scaling
162
+ if self.scale_attn_by_inverse_layer_idx:
163
+ attn_weights = attn_weights / float(self.layer_idx + 1)
164
+
165
+ if attention_mask is not None:
166
+ attn_weights = attn_weights.masked_fill_(attention_mask, torch.finfo(attn_weights.dtype).min)
167
+
168
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)
169
+
170
+ # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
171
+ attn_weights = attn_weights.type(value.dtype)
172
+ attn_weights = self.attn_dropout(attn_weights)
173
+
174
+ # Mask heads if we want to
175
+ if head_mask is not None:
176
+ attn_weights = attn_weights * head_mask
177
+
178
+ # (b, num_heads, sq, sk) -> (b, num_heads * sq, sk)
179
+ _attn_weights = attn_weights.view(batch_size, self.num_heads * query_length, key_length)
180
+ # (b, num_heads * sq, sk) x (b, sk, head_dim) -> (b, num_heads * sq, head_dim)
181
+ attn_output = torch.bmm(_attn_weights, value)
182
+ attn_output = attn_output.view(batch_size, self.num_heads, query_length, self.head_dim)
183
+
184
+ return attn_output, attn_weights
185
+
186
+ def _merge_heads(self, tensor):
187
+ """
188
+ Merges attn_head_size dim and num_attn_heads dim into hidden_size
189
+ """
190
+ batch_size, num_heads, seq_length, head_dim = tensor.shape
191
+
192
+ tensor = tensor.permute(0, 2, 1, 3)
193
+ return tensor.reshape(batch_size, seq_length, num_heads * head_dim)
194
+
195
+ def forward(
196
+ self,
197
+ hidden_states: Optional[Tuple[torch.FloatTensor]],
198
+ layer_past: Optional[Tuple[torch.Tensor]] = None,
199
+ attention_mask: Optional[torch.FloatTensor] = None,
200
+ head_mask: Optional[torch.FloatTensor] = None,
201
+ encoder_hidden_states: Optional[torch.Tensor] = None,
202
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
203
+ use_cache: Optional[bool] = False,
204
+ output_attentions: Optional[bool] = False,
205
+ ) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor]], ...]:
206
+ if encoder_hidden_states is not None:
207
+ raise NotImplementedError("Cross-attention not implemented for MQA")
208
+ else:
209
+ query = self.q_attn(hidden_states)
210
+ key, value = self.kv_attn(hidden_states).split(self.head_dim, dim=2)
211
+
212
+ batch_size, seq_length = query.shape[:2]
213
+ # (query_length, batch, num_heads, head_dim)
214
+ # (batch, num_heads * query_length, head_dim)\
215
+
216
+ # (batch, query_length, hidden_size) -> (batch, num_heads, query_length, head_dim)
217
+ query = query.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
218
+ # -> (batch, num_heads * query_length, head_dim)
219
+ query = query.reshape(batch_size, self.num_heads * seq_length, self.head_dim)
220
+
221
+ key = key.transpose(1, 2) # (batch_size, head_dim, seq_length)
222
+
223
+ if layer_past is not None:
224
+ past_key, past_value = layer_past
225
+ # Concatenate on sequence dimension
226
+ key = torch.cat((past_key, key), dim=-1)
227
+ value = torch.cat((past_value, value), dim=-2)
228
+
229
+ if use_cache is True:
230
+ present = (key, value)
231
+ else:
232
+ present = None
233
+
234
+ if self.reorder_and_upcast_attn:
235
+ raise NotImplementedError("Reorder and upcast attention not implemented for MQA")
236
+ else:
237
+ attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
238
+
239
+ attn_output = self._merge_heads(attn_output)
240
+ attn_output = self.c_proj(attn_output)
241
+ attn_output = self.resid_dropout(attn_output)
242
+
243
+ outputs = (attn_output, present)
244
+ if output_attentions:
245
+ outputs += (attn_weights,)
246
+
247
+ return outputs # a, present, (attentions)
248
+
249
+
250
+ # inherit from gpt_modeling.py, and override `attn` module
251
+ class GPT2CustomBlock(GPT2Block):
252
+
253
+ def __init__(self, config: GPT2CustomConfig, layer_idx=None):
254
+ super().__init__(config, layer_idx)
255
+ # Override attention module if using multiquery
256
+ if config.attention_head_type == MULTI_QUERY:
257
+ self.attn = GPT2MQAttention(config, layer_idx=layer_idx)
258
+ if config.add_cross_attention:
259
+ raise NotImplementedError("Cross-attention not implemented for MQA")
260
+
261
+ hidden_size = config.hidden_size
262
+ inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size
263
+ self.mlp = LinearGPT2MLP(inner_dim, config)
264
+
265
+
266
+ # inherit from gpt_modeling.py and override `__init__` and `forward` methods
267
+ class GPT2CustomModel(GPT2Model):
268
+ config_class = GPT2CustomConfig
269
+
270
+ def __init__(self, config):
271
+ GPT2PreTrainedModel.__init__(self, config)
272
+
273
+ if config.attention_head_type != MULTI_QUERY:
274
+ raise NotImplementedError("optimized gpt2 is not implemented for MHA")
275
+
276
+ self.embed_dim = config.hidden_size
277
+
278
+ self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
279
+ self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
280
+
281
+ self.drop = nn.Dropout(config.embd_pdrop)
282
+ self.h = nn.ModuleList([GPT2CustomBlock(config, layer_idx=i) for i in range(config.num_hidden_layers)])
283
+ self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
284
+
285
+ # Model parallel
286
+ self.model_parallel = False
287
+ self.device_map = None
288
+ self.gradient_checkpointing = False
289
+
290
+ # Initialize weights and apply final processing
291
+ self.post_init()
292
+
293
+ def forward(
294
+ self,
295
+ input_ids: Optional[torch.LongTensor] = None,
296
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
297
+ attention_mask: Optional[torch.FloatTensor] = None,
298
+ token_type_ids: Optional[torch.LongTensor] = None,
299
+ position_ids: Optional[torch.LongTensor] = None,
300
+ head_mask: Optional[torch.FloatTensor] = None,
301
+ inputs_embeds: Optional[torch.FloatTensor] = None,
302
+ encoder_hidden_states: Optional[torch.Tensor] = None,
303
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
304
+ use_cache: Optional[bool] = None,
305
+ output_attentions: Optional[bool] = None,
306
+ output_hidden_states: Optional[bool] = None,
307
+ return_dict: Optional[bool] = None,
308
+ ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
309
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
310
+ output_hidden_states = (
311
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
312
+ )
313
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
314
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
315
+
316
+ if input_ids is not None and inputs_embeds is not None:
317
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
318
+ elif input_ids is not None:
319
+ input_shape = input_ids.size()
320
+ input_ids = input_ids.view(-1, input_shape[-1])
321
+ batch_size = input_ids.shape[0]
322
+ seq_length = input_ids.shape[1]
323
+ elif inputs_embeds is not None:
324
+ input_shape = inputs_embeds.size()[:-1]
325
+ batch_size = inputs_embeds.shape[0]
326
+ seq_length = input_ids.shape[1]
327
+ else:
328
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
329
+
330
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
331
+
332
+ if token_type_ids is not None:
333
+ token_type_ids = token_type_ids.view(-1, input_shape[-1])
334
+ if position_ids is not None:
335
+ position_ids = position_ids.view(-1, input_shape[-1])
336
+
337
+ if past_key_values is None:
338
+ past_key_values = tuple([None] * len(self.h))
339
+
340
+ seq_length_with_past = seq_length
341
+ past_key_values_length = 0
342
+ if past_key_values[0] is not None:
343
+ past_key_values_length = past_key_values[0][0].shape[-1]
344
+ seq_length_with_past = seq_length_with_past + past_key_values_length
345
+ if position_ids is None:
346
+ position_ids = torch.arange(past_key_values_length, input_shape[-1] + past_key_values_length,
347
+ dtype=torch.long, device=device)
348
+ position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
349
+
350
+ # GPT2Attention mask.
351
+ if attention_mask is None:
352
+ attention_mask = torch.ones((batch_size, seq_length_with_past), device=input_ids.device)
353
+ else:
354
+ attention_mask = attention_mask.to(input_ids.device)
355
+
356
+ attention_mask = prepare_attn_mask(
357
+ attention_mask,
358
+ input_shape=(batch_size, seq_length),
359
+ past_key_values_length=past_key_values_length,
360
+ )
361
+
362
+ attention_mask = attention_mask.unsqueeze(1).expand(batch_size, self.config.num_attention_heads,
363
+ *attention_mask.shape[1:])
364
+
365
+ # If a 2D or 3D attention mask is provided for the cross-attention
366
+ # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
367
+ if self.config.add_cross_attention and encoder_hidden_states is not None:
368
+ raise NotImplementedError
369
+ else:
370
+ encoder_attention_mask = None
371
+
372
+ # Prepare head mask if needed
373
+ # 1.0 in head_mask indicate we keep the head
374
+ # attention_probs has shape bsz x n_heads x N x N
375
+ # head_mask has shape n_layer x batch x n_heads x N x N
376
+ head_mask = self.get_head_mask(head_mask, self.config.n_layer)
377
+
378
+ if inputs_embeds is None:
379
+ inputs_embeds = self.wte(input_ids)
380
+ position_embeds = self.wpe(position_ids)
381
+ hidden_states = inputs_embeds + position_embeds
382
+
383
+ if token_type_ids is not None:
384
+ token_type_embeds = self.wte(token_type_ids)
385
+ hidden_states = hidden_states + token_type_embeds
386
+
387
+ hidden_states = self.drop(hidden_states)
388
+
389
+ output_shape = input_shape + (hidden_states.size(-1),)
390
+
391
+ presents = () if use_cache else None
392
+ all_self_attentions = () if output_attentions else None
393
+ all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
394
+ all_hidden_states = () if output_hidden_states else None
395
+ for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
396
+
397
+ # Model parallel
398
+ if self.model_parallel:
399
+ torch.cuda.set_device(hidden_states.device)
400
+ # Ensure layer_past is on same device as hidden_states (might not be correct)
401
+ if layer_past is not None:
402
+ layer_past = tuple(past_state.to(hidden_states.device) for past_state in layer_past)
403
+ # Ensure that attention_mask is always on the same device as hidden_states
404
+ if attention_mask is not None:
405
+ attention_mask = attention_mask.to(hidden_states.device)
406
+ if isinstance(head_mask, torch.Tensor):
407
+ head_mask = head_mask.to(hidden_states.device)
408
+ if output_hidden_states:
409
+ all_hidden_states = all_hidden_states + (hidden_states,)
410
+
411
+ if self.gradient_checkpointing and self.training:
412
+
413
+ if use_cache:
414
+ logger.warning(
415
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
416
+ )
417
+ use_cache = False
418
+
419
+ def create_custom_forward(module):
420
+ def custom_forward(*inputs):
421
+ # None for past_key_value
422
+ return module(*inputs, use_cache, output_attentions)
423
+
424
+ return custom_forward
425
+
426
+ outputs = torch.utils.checkpoint.checkpoint(
427
+ create_custom_forward(block),
428
+ hidden_states,
429
+ None,
430
+ attention_mask,
431
+ head_mask[i],
432
+ encoder_hidden_states,
433
+ encoder_attention_mask,
434
+ )
435
+ else:
436
+ outputs = block(
437
+ hidden_states,
438
+ layer_past=layer_past,
439
+ attention_mask=attention_mask,
440
+ head_mask=head_mask[i],
441
+ encoder_hidden_states=encoder_hidden_states,
442
+ encoder_attention_mask=encoder_attention_mask,
443
+ use_cache=use_cache,
444
+ output_attentions=output_attentions,
445
+ )
446
+
447
+ hidden_states = outputs[0]
448
+ if use_cache is True:
449
+ presents = presents + (outputs[1],)
450
+
451
+ if output_attentions:
452
+ all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
453
+ if self.config.add_cross_attention:
454
+ all_cross_attentions = all_cross_attentions + (outputs[3 if use_cache else 2],)
455
+
456
+ # Model Parallel: If it's the last layer for that device, put things on the next device
457
+ if self.model_parallel:
458
+ for k, v in self.device_map.items():
459
+ if i == v[-1] and "cuda:" + str(k) != self.last_device:
460
+ hidden_states = hidden_states.to("cuda:" + str(k + 1))
461
+
462
+ hidden_states = self.ln_f(hidden_states)
463
+
464
+ hidden_states = hidden_states.view(output_shape)
465
+ # Add last hidden state
466
+ if output_hidden_states:
467
+ all_hidden_states = all_hidden_states + (hidden_states,)
468
+
469
+ if not return_dict:
470
+ return tuple(
471
+ v
472
+ for v in [hidden_states, presents, all_hidden_states, all_self_attentions, all_cross_attentions]
473
+ if v is not None
474
+ )
475
+
476
+ return BaseModelOutputWithPastAndCrossAttentions(
477
+ last_hidden_state=hidden_states,
478
+ past_key_values=presents,
479
+ hidden_states=all_hidden_states,
480
+ attentions=all_self_attentions,
481
+ cross_attentions=all_cross_attentions,
482
+ )
483
+
484
+
485
+ class GPT2LMHeadCustomModel(GPT2LMHeadModel):
486
+ config_class = GPT2CustomConfig
487
+
488
+ def __init__(self, config):
489
+ GPT2PreTrainedModel.__init__(self, config)
490
+ self.transformer = GPT2CustomModel(config)
491
+ self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
492
+
493
+ # Model parallel
494
+ self.model_parallel = False
495
+ self.device_map = None
496
+
497
+ # Initialize weights and apply final processing
498
+ self.post_init()
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "name_or_path": "bigcode/digit-bytelevel-bpe-jss-v1.1-49152",
3
+ "special_tokens_map_file": "/Users/leandro/.cache/huggingface/hub/models--bigcode--digit-bytelevel-bpe-jss-v1.1-49152/snapshots/fa09b77949689a484afafc5f89534e6b6ba2c151/special_tokens_map.json",
4
+ "tokenizer_class": "PreTrainedTokenizerFast",
5
+ "vocab_size": 49152,
6
+ "model_max_length": 2048
7
+ }