File size: 1,389 Bytes
33b66b7
16a2cf1
 
 
4561449
 
 
16a2cf1
33b66b7
16a2cf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
language:
- sl
- en
- multilingual
tags:
- generated_from_trainer
licence: cc-by-sa-4.0
---

# SloBERTa-SlEng

SloBERTa-SlEng is a masked language model, based on the [SloBERTa](https://huggingface.co./EMBEDDIA/sloberta) Slovene model.

SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model. 
The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on.
They are the same as in the [SlEng-bert](https://huggingface.co./cjvt/sleng-bert) model.
The new embedding weights were initialized from the SloBERTa embeddings.

The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora,
the same as the [SlEng-bert](https://huggingface.co./cjvt/sleng-bert) model.

## Training corpora

The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201),
and a small subset of English [Oscar](https://huggingface.co./datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible.
Training corpora had in total about 2.7 billion words.


### Framework versions

- Transformers 4.22.0.dev0
- Pytorch 1.13.0a0+d321be6
- Datasets 2.4.0
- Tokenizers 0.12.1