File size: 3,119 Bytes
328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 a11d380 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 328bad4 5ac24e2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
library_name: transformers
tags:
- bert
- cramming
- NLU
license: apache-2.0
datasets:
- TucanoBR/GigaVerbo
language:
- pt
pipeline_tag: fill-mask
---
# crammed BERT Portuguese
<!-- Provide a quick summary of what the model is/does. -->
This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".
To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and `import cramming` before using the 🤗 transformers `AutoModel` (see below).
## How to use
```python
import cramming # needed to load crammed model
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
model = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
text = "Oi, eu sou um modelo <mask>."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
## Training Details
### Training Data & Config
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- 30M entries from `TucanoBR/GigaVerbo`
- 107M sequences of length 128
- tokenizer: WordPiece
- vocab_size: 32768
- seq_length: 128
- include_cls_token_in_corpus: false
- include_sep_token_in_corpus: true
### Training Procedure
- **optim**:
- type: AdamW
- lr: 0.001
- betas:
- 0.9
- 0.98
- eps: 1.0e-12
- weight_decay: 0.01
- amsgrad: false
- fused: null
- warmup_steps: 0
- cooldown_steps: 0
- steps: 900000
- batch_size: 8192
- gradient_clipping: 0.5
- **objective**:
- name: masked-lm
- mlm_probability: 0.25
- token_drop: 0.0
#### Training Hyperparameters
- num_transformer_layers: 16
- hidden_size: 768
- intermed_size: 3072
- hidden_dropout_prob: 0.1
- norm: LayerNorm
- norm_eps: 1.0e-12
- norm_scheme: pre
- nonlin: GELUglu
- tie_weights: true
- decoder_bias: false
- sparse_prediction: 0.25
- loss: cross-entropy
- **embedding**:
- vocab_size: null
- pos_embedding: scaled-sinusoidal
- dropout_prob: 0.1
- pad_token_id: 0
- max_seq_length: 128
- embedding_dim: 768
- normalization: true
- stable_low_precision: false
- **attention**:
- type: self-attention
- causal_attention: false
- num_attention_heads: 12
- dropout_prob: 0.1
- skip_output_projection: false
- qkv_bias: false
- rotary_embedding: false
- seq_op_in_fp32: false
- sequence_op: torch-softmax
- **init**:
- type: normal
- std: 0.02
- ffn_layer_frequency: 1
- skip_head_transform: true
- use_bias: false
- **classification_head**:
- pooler: avg
- include_ff_layer: true
- head_dim: 1024
- nonlin: Tanh
- classifier_dropout: 0.1
#### Speeds, Sizes, Times
- ~ 0.1674s per step (97886t/s)
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
TBD |