crammed BERT Portuguese

This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".

To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and import cramming before using the 🤗 transformers AutoModel (see below).

How to use


import cramming # needed to load crammed model
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
model  = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")

text = "Oi, eu sou um modelo <mask>."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Training Details

Training Data & Config

  • 30M entries from TucanoBR/GigaVerbo
  • 107M sequences of length 128
  • tokenizer: WordPiece
  • vocab_size: 32768
  • seq_length: 128
  • include_cls_token_in_corpus: false
  • include_sep_token_in_corpus: true

Training Procedure

  • optim:

    • type: AdamW
    • lr: 0.001
    • betas:
    • 0.9
    • 0.98
    • eps: 1.0e-12
    • weight_decay: 0.01
    • amsgrad: false
    • fused: null
    • warmup_steps: 0
    • cooldown_steps: 0
    • steps: 900000
    • batch_size: 8192
    • gradient_clipping: 0.5
  • objective:

    • name: masked-lm
    • mlm_probability: 0.25
    • token_drop: 0.0

Training Hyperparameters

  • num_transformer_layers: 16

  • hidden_size: 768

  • intermed_size: 3072

  • hidden_dropout_prob: 0.1

  • norm: LayerNorm

  • norm_eps: 1.0e-12

  • norm_scheme: pre

  • nonlin: GELUglu

  • tie_weights: true

  • decoder_bias: false

  • sparse_prediction: 0.25

  • loss: cross-entropy

  • embedding:

    • vocab_size: null
    • pos_embedding: scaled-sinusoidal
    • dropout_prob: 0.1
    • pad_token_id: 0
    • max_seq_length: 128
    • embedding_dim: 768
    • normalization: true
    • stable_low_precision: false
  • attention:

    • type: self-attention
    • causal_attention: false
    • num_attention_heads: 12
    • dropout_prob: 0.1
    • skip_output_projection: false
    • qkv_bias: false
    • rotary_embedding: false
    • seq_op_in_fp32: false
    • sequence_op: torch-softmax
  • init:

    • type: normal
    • std: 0.02
  • ffn_layer_frequency: 1

  • skip_head_transform: true

  • use_bias: false

  • classification_head:

  • pooler: avg

  • include_ff_layer: true

  • head_dim: 1024

  • nonlin: Tanh

  • classifier_dropout: 0.1

Speeds, Sizes, Times

  • ~ 0.1674s per step (97886t/s)

Evaluation

TBD

Downloads last month
58
Safetensors
Model size
145M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train wilsonmarciliojr/crammed-bert-portuguese