File size: 3,119 Bytes
328bad4
 
5ac24e2
 
 
 
 
 
 
 
 
 
328bad4
 
5ac24e2
328bad4
 
 
5ac24e2
328bad4
5ac24e2
328bad4
 
5ac24e2
328bad4
 
5ac24e2
328bad4
5ac24e2
 
328bad4
5ac24e2
 
328bad4
5ac24e2
 
 
328bad4
5ac24e2
328bad4
 
 
5ac24e2
328bad4
 
 
a11d380
 
5ac24e2
 
 
 
 
328bad4
 
 
 
5ac24e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328bad4
 
 
 
5ac24e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328bad4
 
 
 
 
 
5ac24e2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
library_name: transformers
tags:
- bert
- cramming
- NLU
license: apache-2.0
datasets:
- TucanoBR/GigaVerbo
language:
- pt
pipeline_tag: fill-mask
---

# crammed BERT Portuguese

<!-- Provide a quick summary of what the model is/does. -->

This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".

To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and `import cramming` before using the 🤗 transformers `AutoModel` (see below). 


## How to use


```python

import cramming # needed to load crammed model
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
model  = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")

text = "Oi, eu sou um modelo <mask>."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

```

## Training Details

### Training Data & Config

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- 30M entries from `TucanoBR/GigaVerbo`
- 107M sequences of length 128
- tokenizer: WordPiece
- vocab_size: 32768
- seq_length: 128
- include_cls_token_in_corpus: false
- include_sep_token_in_corpus: true

### Training Procedure


- **optim**:
  
  -  type: AdamW
  -  lr: 0.001
  -  betas:
    - 0.9
    - 0.98
  -  eps: 1.0e-12
  -  weight_decay: 0.01
  -  amsgrad: false
  -  fused: null
  - warmup_steps: 0
  - cooldown_steps: 0
  - steps: 900000
  - batch_size: 8192
  - gradient_clipping: 0.5
    
- **objective**:
  -  name: masked-lm
  -  mlm_probability: 0.25
  -  token_drop: 0.0


#### Training Hyperparameters

- num_transformer_layers: 16
- hidden_size: 768
-  intermed_size: 3072
-  hidden_dropout_prob: 0.1
-  norm: LayerNorm
-  norm_eps: 1.0e-12
-  norm_scheme: pre
-  nonlin: GELUglu
-  tie_weights: true
-  decoder_bias: false
-  sparse_prediction: 0.25
-  loss: cross-entropy

- **embedding**:  
  -  vocab_size: null
  -  pos_embedding: scaled-sinusoidal
  -  dropout_prob: 0.1
  -  pad_token_id: 0
  -  max_seq_length: 128
  -  embedding_dim: 768
  -  normalization: true
  -  stable_low_precision: false
    
- **attention**:
  -  type: self-attention
  -  causal_attention: false
  -  num_attention_heads: 12
  -  dropout_prob: 0.1
  -  skip_output_projection: false
  -  qkv_bias: false
  -  rotary_embedding: false
  -  seq_op_in_fp32: false
  -  sequence_op: torch-softmax
    
- **init**:

  -  type: normal
  -  std: 0.02
  
-  ffn_layer_frequency: 1
-  skip_head_transform: true
-  use_bias: false
 
-  **classification_head**:

  - pooler: avg
  -  include_ff_layer: true
  -  head_dim: 1024
  -  nonlin: Tanh
  -  classifier_dropout: 0.1

#### Speeds, Sizes, Times 

 - ~ 0.1674s per step (97886t/s)


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

TBD