crammed BERT Portuguese
This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".
To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and import cramming
before using the 🤗 transformers AutoModel
(see below).
How to use
import cramming # needed to load crammed model
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
model = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
text = "Oi, eu sou um modelo <mask>."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Training Details
Training Data & Config
- 30M entries from
TucanoBR/GigaVerbo
- 107M sequences of length 128
- tokenizer: WordPiece
- vocab_size: 32768
- seq_length: 128
- include_cls_token_in_corpus: false
- include_sep_token_in_corpus: true
Training Procedure
optim:
- type: AdamW
- lr: 0.001
- betas:
- 0.9
- 0.98
- eps: 1.0e-12
- weight_decay: 0.01
- amsgrad: false
- fused: null
- warmup_steps: 0
- cooldown_steps: 0
- steps: 900000
- batch_size: 8192
- gradient_clipping: 0.5
objective:
- name: masked-lm
- mlm_probability: 0.25
- token_drop: 0.0
Training Hyperparameters
num_transformer_layers: 16
hidden_size: 768
intermed_size: 3072
hidden_dropout_prob: 0.1
norm: LayerNorm
norm_eps: 1.0e-12
norm_scheme: pre
nonlin: GELUglu
tie_weights: true
decoder_bias: false
sparse_prediction: 0.25
loss: cross-entropy
embedding:
- vocab_size: null
- pos_embedding: scaled-sinusoidal
- dropout_prob: 0.1
- pad_token_id: 0
- max_seq_length: 128
- embedding_dim: 768
- normalization: true
- stable_low_precision: false
attention:
- type: self-attention
- causal_attention: false
- num_attention_heads: 12
- dropout_prob: 0.1
- skip_output_projection: false
- qkv_bias: false
- rotary_embedding: false
- seq_op_in_fp32: false
- sequence_op: torch-softmax
init:
- type: normal
- std: 0.02
ffn_layer_frequency: 1
skip_head_transform: true
use_bias: false
classification_head:
pooler: avg
include_ff_layer: true
head_dim: 1024
nonlin: Tanh
classifier_dropout: 0.1
Speeds, Sizes, Times
- ~ 0.1674s per step (97886t/s)
Evaluation
TBD
- Downloads last month
- 58