Fill-Mask
Transformers
PyTorch
Portuguese
deberta-v2
albertina-pt*
albertina-100m-portuguese-ptpt
albertina-100m-portuguese-ptbr
albertina-900m-portuguese-ptpt
albertina-900m-portuguese-ptbr
albertina-1b5-portuguese-ptpt
albertina-1b5-portuguese-ptbr
bert
deberta
portuguese
encoder
foundation model
Inference Endpoints
Update README.md
Browse files
README.md
CHANGED
@@ -25,7 +25,7 @@ widget:
|
|
25 |
---
|
26 |
---
|
27 |
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
28 |
-
<p style="text-align: center;"> This is the model card for Albertina 1.5B PTPT
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
|
30 |
</p>
|
31 |
|
@@ -34,13 +34,13 @@ widget:
|
|
34 |
# Albertina 1.5B PTPT
|
35 |
|
36 |
|
37 |
-
**Albertina 1.5B PTPT** is a foundation, large language model for **European
|
38 |
|
39 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
40 |
developed over the DeBERTa model, with most competitive performance for this language.
|
41 |
It has different versions that were trained for different variants of Portuguese (PT),
|
42 |
-
namely the European variant
|
43 |
-
and it is distributed free of charge
|
44 |
|
45 |
| Albertina's Family of Models |
|
46 |
|----------------------------------------------------------------------------------------------------------|
|
@@ -53,10 +53,10 @@ and it is distributed free of charge and under a most permissible license.
|
|
53 |
| [**Albertina 100M PTPT**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptpt-encoder) |
|
54 |
| [**Albertina 100M PTBR**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptbr-encoder) |
|
55 |
|
56 |
-
**Albertina 1.5B PTPT** is
|
57 |
and to the best of our knowledge, this is an encoder specifically for this language and variant
|
58 |
-
that, at the time of its initial distribution, sets a new state of the art for it,
|
59 |
-
and distributed for reuse.
|
60 |
|
61 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
62 |
developed over the DeBERTa model, with most competitive performance for this language.
|
@@ -64,7 +64,7 @@ It is distributed free of charge and under a most permissible license.
|
|
64 |
|
65 |
|
66 |
**Albertina 1.5B PTPT** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
67 |
-
For
|
68 |
|
69 |
``` latex
|
70 |
@misc{albertina-pt-fostering,
|
@@ -109,26 +109,30 @@ DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBER
|
|
109 |
## Preprocessing
|
110 |
|
111 |
We filtered the PTPT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
|
112 |
-
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering
|
|
|
113 |
|
114 |
|
115 |
## Training
|
116 |
|
117 |
As codebase, we resorted to the [DeBERTa V2 xxlarge](https://huggingface.co/microsoft/deberta-v2-xxlarge), for English.
|
118 |
|
119 |
-
To train **Albertina 1.5B PTPT**, the data set was tokenized with the original DeBERTa tokenizer with a 128-token sequence
|
120 |
-
|
|
|
|
|
121 |
These steps correspond to the equivalent setup of 48 hours on a2-megagpu-16gb Google Cloud A2 node for the 128-token input sequences, 24 hours of computation for the 256-token
|
122 |
input sequences and 24 hours of computation for the 512-token input sequences.
|
123 |
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
124 |
|
125 |
<br>
|
126 |
|
127 |
-
#
|
128 |
|
129 |
|
130 |
-
We resorted to [
|
131 |
-
We automatically translated the tasks from GLUE and SUPERGLUE using [DeepL Translate](https://www.deepl.com/), which specifically
|
|
|
132 |
|
133 |
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) | COPA (Accuracy) | CB (F1) | MultiRC (F1) | BoolQ (Accuracy) |
|
134 |
|-------------------------------|----------------|----------------|-----------|-----------------|-----------------|------------|--------------|------------------|
|
|
|
25 |
---
|
26 |
---
|
27 |
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
28 |
+
<p style="text-align: center;"> This is the model card for Albertina 1.5B PTPT.
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
|
30 |
</p>
|
31 |
|
|
|
34 |
# Albertina 1.5B PTPT
|
35 |
|
36 |
|
37 |
+
**Albertina 1.5B PTPT** is a foundation, large language model for the **European variant of Portuguese**.
|
38 |
|
39 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
40 |
developed over the DeBERTa model, with most competitive performance for this language.
|
41 |
It has different versions that were trained for different variants of Portuguese (PT),
|
42 |
+
namely the European variant, spoken in Portugal (**PTPT**) and the American variant, spoken in Brazil (**PTBR**),
|
43 |
+
and it is openly distributed free of charge under an open license.
|
44 |
|
45 |
| Albertina's Family of Models |
|
46 |
|----------------------------------------------------------------------------------------------------------|
|
|
|
53 |
| [**Albertina 100M PTPT**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptpt-encoder) |
|
54 |
| [**Albertina 100M PTBR**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptbr-encoder) |
|
55 |
|
56 |
+
**Albertina 1.5B PTPT** is a version for the **European variant of Portuguese**,
|
57 |
and to the best of our knowledge, this is an encoder specifically for this language and variant
|
58 |
+
that, at the time of its initial distribution, with its 1.5 billion parameters and performance scores sets a new state of the art for it,
|
59 |
+
and is made publicly available and distributed for reuse.
|
60 |
|
61 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
62 |
developed over the DeBERTa model, with most competitive performance for this language.
|
|
|
64 |
|
65 |
|
66 |
**Albertina 1.5B PTPT** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
67 |
+
For a fully detailed description, check the respective [publication](https://arxiv.org/abs/?):
|
68 |
|
69 |
``` latex
|
70 |
@misc{albertina-pt-fostering,
|
|
|
109 |
## Preprocessing
|
110 |
|
111 |
We filtered the PTPT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
|
112 |
+
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering
|
113 |
+
for language identification given the corpus was pre-selected as Portuguese.
|
114 |
|
115 |
|
116 |
## Training
|
117 |
|
118 |
As codebase, we resorted to the [DeBERTa V2 xxlarge](https://huggingface.co/microsoft/deberta-v2-xxlarge), for English.
|
119 |
|
120 |
+
To train **Albertina 1.5B PTPT**, the data set was tokenized with the original DeBERTa tokenizer with a 128-token sequence
|
121 |
+
truncation and dynamic padding for 250k steps,
|
122 |
+
a 256-token sequence-truncation for 80k steps
|
123 |
+
([**Albertina 1.5B PTPT 256**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptpt-encoder-256)) and finally a 512-token sequence-truncation for 60k steps.
|
124 |
These steps correspond to the equivalent setup of 48 hours on a2-megagpu-16gb Google Cloud A2 node for the 128-token input sequences, 24 hours of computation for the 256-token
|
125 |
input sequences and 24 hours of computation for the 512-token input sequences.
|
126 |
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
127 |
|
128 |
<br>
|
129 |
|
130 |
+
# Performance
|
131 |
|
132 |
|
133 |
+
We resorted to [extraGLUE](https://huggingface.co/datasets/PORTULAN/extraglue), a **PTPT version of the GLUE and SUPERGLUE** benchmark.
|
134 |
+
We automatically translated the tasks from GLUE and SUPERGLUE using [DeepL Translate](https://www.deepl.com/), which specifically
|
135 |
+
provides translation from English to PTPT or PTBR as possible options.
|
136 |
|
137 |
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) | COPA (Accuracy) | CB (F1) | MultiRC (F1) | BoolQ (Accuracy) |
|
138 |
|-------------------------------|----------------|----------------|-----------|-----------------|-----------------|------------|--------------|------------------|
|