|
--- |
|
language: |
|
- sv |
|
- 'no' |
|
- da |
|
- en |
|
license: mit |
|
tags: |
|
- bert |
|
- roberta |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: Huvudstaden i Sverige är <mask>. |
|
example_title: Swedish |
|
- text: Hovedstaden i Norge er <mask>. |
|
example_title: Norwegian |
|
- text: Danmarks hovedstad er <mask>. |
|
example_title: Danish |
|
--- |
|
|
|
# roberta-large-1160k |
|
|
|
## Intended uses |
|
|
|
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. |
|
|
|
### How to use |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='AI-Sweden-Models/roberta-large-1160k') |
|
>>> unmasker("Huvudstaden i Sverige är <mask>.") |
|
[{'score': 0.5841221213340759, |
|
'token': 1945, |
|
'token_str': ' Stockholm', |
|
'sequence': 'Huvudstaden i Sverige är Stockholm.'}, |
|
{'score': 0.06775698810815811, |
|
'token': 5007, |
|
'token_str': ' Göteborg', |
|
'sequence': 'Huvudstaden i Sverige är Göteborg.'}, |
|
{'score': 0.05057400465011597, |
|
'token': 5761, |
|
'token_str': ' Malmö', |
|
'sequence': 'Huvudstaden i Sverige är Malmö.'}, |
|
{'score': 0.021936343982815742, |
|
'token': 21449, |
|
'token_str': ' Norrköping', |
|
'sequence': 'Huvudstaden i Sverige är Norrköping.'}, |
|
{'score': 0.017798304557800293, |
|
'token': 5658, |
|
'token_str': ' Uppsala', |
|
'sequence': 'Huvudstaden i Sverige är Uppsala.'}] |
|
``` |
|
```python |
|
>>> unmasker("Hovedstaden i Norge er <mask>.") |
|
[{'score': 0.6792309284210205, |
|
'token': 5158, |
|
'token_str': ' Oslo', |
|
'sequence': 'Hovedstaden i Norge er Oslo.'}, |
|
{'score': 0.09379775077104568, |
|
'token': 15456, |
|
'token_str': ' Trondheim', |
|
'sequence': 'Hovedstaden i Norge er Trondheim.'}, |
|
{'score': 0.052535850554704666, |
|
'token': 11370, |
|
'token_str': ' Bergen', |
|
'sequence': 'Hovedstaden i Norge er Bergen.'}, |
|
{'score': 0.03465486690402031, |
|
'token': 29407, |
|
'token_str': ' hovedstaden', |
|
'sequence': 'Hovedstaden i Norge er hovedstaden.'}, |
|
{'score': 0.03017985075712204, |
|
'token': 33311, |
|
'token_str': ' Kristiansand', |
|
'sequence': 'Hovedstaden i Norge er Kristiansand.'}] |
|
``` |
|
```python |
|
>>> unmasker("Danmarks hovedstad er <mask>.") |
|
[{'score': 0.11624140292406082, |
|
'token': 4794, |
|
'token_str': ' København', |
|
'sequence': 'Danmarks hovedstad er København.'}, |
|
{'score': 0.045051511377096176, |
|
'token': 7680, |
|
'token_str': ' død', |
|
'sequence': 'Danmarks hovedstad er død.'}, |
|
{'score': 0.02936543896794319, |
|
'token': 10795, |
|
'token_str': ' lukket', |
|
'sequence': 'Danmarks hovedstad er lukket.'}, |
|
{'score': 0.026030730456113815, |
|
'token': 13580, |
|
'token_str': ' Odense', |
|
'sequence': 'Danmarks hovedstad er Odense.'}, |
|
{'score': 0.02130937948822975, |
|
'token': 16347, |
|
'token_str': ' Roskilde', |
|
'sequence': 'Danmarks hovedstad er Roskilde.'}] |
|
``` |
|
|
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
|
|
```python |
|
from transformers import RobertaTokenizer, RobertaModel |
|
tokenizer = RobertaTokenizer.from_pretrained('AI-Sweden-Models/roberta-large-1160k') |
|
model = RobertaModel.from_pretrained('AI-Sweden-Models/roberta-large-1160k') |
|
text = "Replace me by any text you'd like." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
``` |
|
|
|
## Training data |
|
The Scandinavian subset of the Nordic Pile (Swedish, Norwegian, Danish), consisting of 414 962 688 text samples. |
|
|
|
## Training procedure |
|
|
|
The model was trained with the [optimum-habana](https://github.com/huggingface/optimum-habana) framework. Utilizing 8X Intel® Gaudi® 2 AI accelerators, managed by Intel Sweden AB. |
|
|
|
The weights from https://huggingface.co./FacebookAI/roberta-large are used as initialization, and the tokenizer is trained from scratch. |
|
|
|
This model is a checkpoint (1 160 000 / 1 350 790). The final run is 5 epochs. This is epoch: 4.29. |
|
|
|
A batch size of 1536 was used. |
|
|
|
## Evaluation results |
|
|
|
When fine-tuned on downstream tasks, this model achieves the following results: |
|
| rank | da_rank | no_rank | sv_rank | dansk | angry_tweets | scala_da | scandiqa_da | norne_nb | norne_nn | norec | scala_nb | scala_nn | norquad | suc3 | swerec | scala_sv | scandiqa_sv | |
|
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |
|
| 1.3 | 1.33 | 1.34 | 1.23 | 74.16 | 51.2 | 73.87 | 49.34 | 92.01 | 87.17 | 60.11 | 72.85 | 65.56 | 60.38 | 82.65 | 77.25 | 77.9 | 49.64 | |
|
|
|
As by (2024/03/26) it is ranked #2 at [ScandEval](https://scandeval.com/swedish-nlu/) after *gpt-4-0613*. |
|
|