File size: 3,119 Bytes

---
library_name: transformers
tags:
- bert
- cramming
- NLU
license: apache-2.0
datasets:
- TucanoBR/GigaVerbo
language:
- pt
pipeline_tag: fill-mask
---

# crammed BERT Portuguese

<!-- Provide a quick summary of what the model is/does. -->

This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".

To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and `import cramming` before using the 🤗 transformers `AutoModel` (see below). 


## How to use


```python

import cramming # needed to load crammed model
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
model  = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")

text = "Oi, eu sou um modelo <mask>."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

```

## Training Details

### Training Data & Config

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- 30M entries from `TucanoBR/GigaVerbo`
- 107M sequences of length 128
- tokenizer: WordPiece
- vocab_size: 32768
- seq_length: 128
- include_cls_token_in_corpus: false
- include_sep_token_in_corpus: true

### Training Procedure


- **optim**:
  
  -  type: AdamW
  -  lr: 0.001
  -  betas:
    - 0.9
    - 0.98
  -  eps: 1.0e-12
  -  weight_decay: 0.01
  -  amsgrad: false
  -  fused: null
  - warmup_steps: 0
  - cooldown_steps: 0
  - steps: 900000
  - batch_size: 8192
  - gradient_clipping: 0.5
    
- **objective**:
  -  name: masked-lm
  -  mlm_probability: 0.25
  -  token_drop: 0.0


#### Training Hyperparameters

- num_transformer_layers: 16
- hidden_size: 768
-  intermed_size: 3072
-  hidden_dropout_prob: 0.1
-  norm: LayerNorm
-  norm_eps: 1.0e-12
-  norm_scheme: pre
-  nonlin: GELUglu
-  tie_weights: true
-  decoder_bias: false
-  sparse_prediction: 0.25
-  loss: cross-entropy

- **embedding**:  
  -  vocab_size: null
  -  pos_embedding: scaled-sinusoidal
  -  dropout_prob: 0.1
  -  pad_token_id: 0
  -  max_seq_length: 128
  -  embedding_dim: 768
  -  normalization: true
  -  stable_low_precision: false
    
- **attention**:
  -  type: self-attention
  -  causal_attention: false
  -  num_attention_heads: 12
  -  dropout_prob: 0.1
  -  skip_output_projection: false
  -  qkv_bias: false
  -  rotary_embedding: false
  -  seq_op_in_fp32: false
  -  sequence_op: torch-softmax
    
- **init**:

  -  type: normal
  -  std: 0.02
  
-  ffn_layer_frequency: 1
-  skip_head_transform: true
-  use_bias: false
 
-  **classification_head**:

  - pooler: avg
  -  include_ff_layer: true
  -  head_dim: 1024
  -  nonlin: Tanh
  -  classifier_dropout: 0.1

#### Speeds, Sizes, Times 

 - ~ 0.1674s per step (97886t/s)


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

TBD