File size: 4,431 Bytes
53b86b7 81e3fd2 53b86b7 81e3fd2 53b86b7 81e3fd2 877fd81 81e3fd2 877fd81 81e3fd2 877fd81 81e3fd2 6b8cd05 e17ed1c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
language:
- sv
- 'no'
- da
- en
license: mit
tags:
- bert
- roberta
pipeline_tag: fill-mask
widget:
- text: Huvudstaden i Sverige är <mask>.
example_title: Swedish
- text: Hovedstaden i Norge er <mask>.
example_title: Norwegian
- text: Danmarks hovedstad er <mask>.
example_title: Danish
---
# roberta-large-1160k
## Intended uses
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='AI-Sweden-Models/roberta-large-1160k')
>>> unmasker("Huvudstaden i Sverige är <mask>.")
[{'score': 0.5841221213340759,
'token': 1945,
'token_str': ' Stockholm',
'sequence': 'Huvudstaden i Sverige är Stockholm.'},
{'score': 0.06775698810815811,
'token': 5007,
'token_str': ' Göteborg',
'sequence': 'Huvudstaden i Sverige är Göteborg.'},
{'score': 0.05057400465011597,
'token': 5761,
'token_str': ' Malmö',
'sequence': 'Huvudstaden i Sverige är Malmö.'},
{'score': 0.021936343982815742,
'token': 21449,
'token_str': ' Norrköping',
'sequence': 'Huvudstaden i Sverige är Norrköping.'},
{'score': 0.017798304557800293,
'token': 5658,
'token_str': ' Uppsala',
'sequence': 'Huvudstaden i Sverige är Uppsala.'}]
```
```python
>>> unmasker("Hovedstaden i Norge er <mask>.")
[{'score': 0.6792309284210205,
'token': 5158,
'token_str': ' Oslo',
'sequence': 'Hovedstaden i Norge er Oslo.'},
{'score': 0.09379775077104568,
'token': 15456,
'token_str': ' Trondheim',
'sequence': 'Hovedstaden i Norge er Trondheim.'},
{'score': 0.052535850554704666,
'token': 11370,
'token_str': ' Bergen',
'sequence': 'Hovedstaden i Norge er Bergen.'},
{'score': 0.03465486690402031,
'token': 29407,
'token_str': ' hovedstaden',
'sequence': 'Hovedstaden i Norge er hovedstaden.'},
{'score': 0.03017985075712204,
'token': 33311,
'token_str': ' Kristiansand',
'sequence': 'Hovedstaden i Norge er Kristiansand.'}]
```
```python
>>> unmasker("Danmarks hovedstad er <mask>.")
[{'score': 0.11624140292406082,
'token': 4794,
'token_str': ' København',
'sequence': 'Danmarks hovedstad er København.'},
{'score': 0.045051511377096176,
'token': 7680,
'token_str': ' død',
'sequence': 'Danmarks hovedstad er død.'},
{'score': 0.02936543896794319,
'token': 10795,
'token_str': ' lukket',
'sequence': 'Danmarks hovedstad er lukket.'},
{'score': 0.026030730456113815,
'token': 13580,
'token_str': ' Odense',
'sequence': 'Danmarks hovedstad er Odense.'},
{'score': 0.02130937948822975,
'token': 16347,
'token_str': ' Roskilde',
'sequence': 'Danmarks hovedstad er Roskilde.'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('AI-Sweden-Models/roberta-large-1160k')
model = RobertaModel.from_pretrained('AI-Sweden-Models/roberta-large-1160k')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
The Scandinavian subset of the Nordic Pile (Swedish, Norwegian, Danish), consisting of 414 962 688 text samples.
## Training procedure
The model was trained with the [optimum-habana](https://github.com/huggingface/optimum-habana) framework. Utilizing 8X Intel® Gaudi® 2 AI accelerators, managed by Intel Sweden AB.
The weights from https://huggingface.co./FacebookAI/roberta-large are used as initialization, and the tokenizer is trained from scratch.
This model is a checkpoint (1 160 000 / 1 350 790). The final run is 5 epochs. This is epoch: 4.29.
A batch size of 1536 was used.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
| rank | da_rank | no_rank | sv_rank | dansk | angry_tweets | scala_da | scandiqa_da | norne_nb | norne_nn | norec | scala_nb | scala_nn | norquad | suc3 | swerec | scala_sv | scandiqa_sv |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
| 1.3 | 1.33 | 1.34 | 1.23 | 74.16 | 51.2 | 73.87 | 49.34 | 92.01 | 87.17 | 60.11 | 72.85 | 65.56 | 60.38 | 82.65 | 77.25 | 77.9 | 49.64 |
As by (2024/03/26) it is ranked #2 at [ScandEval](https://scandeval.com/swedish-nlu/) after *gpt-4-0613*.
|