sloberta-sleng / README.md
matejklemen's picture
Add multilingual to the language tag (#1)
66edf25
|
raw
history blame
1.39 kB
---
language:
- sl
- en
- multilingual
tags:
- generated_from_trainer
licence: cc-by-sa-4.0
---
# SloBERTa-SlEng
SloBERTa-SlEng is a masked language model, based on the [SloBERTa](https://huggingface.co./EMBEDDIA/sloberta) Slovene model.
SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model.
The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on.
They are the same as in the [SlEng-bert](https://huggingface.co./cjvt/sleng-bert) model.
The new embedding weights were initialized from the SloBERTa embeddings.
The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora,
the same as the [SlEng-bert](https://huggingface.co./cjvt/sleng-bert) model.
## Training corpora
The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201),
and a small subset of English [Oscar](https://huggingface.co./datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible.
Training corpora had in total about 2.7 billion words.
### Framework versions
- Transformers 4.22.0.dev0
- Pytorch 1.13.0a0+d321be6
- Datasets 2.4.0
- Tokenizers 0.12.1