--- language: - sl - en - multilingual tags: - generated_from_trainer licence: cc-by-sa-4.0 --- # SloBERTa-SlEng SloBERTa-SlEng is a masked language model, based on the [SloBERTa](https://huggingface.co./EMBEDDIA/sloberta) Slovene model. SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model. The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on. They are the same as in the [SlEng-bert](https://huggingface.co./cjvt/sleng-bert) model. The new embedding weights were initialized from the SloBERTa embeddings. The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora, the same as the [SlEng-bert](https://huggingface.co./cjvt/sleng-bert) model. ## Training corpora The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201), and a small subset of English [Oscar](https://huggingface.co./datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. Training corpora had in total about 2.7 billion words. ### Framework versions - Transformers 4.22.0.dev0 - Pytorch 1.13.0a0+d321be6 - Datasets 2.4.0 - Tokenizers 0.12.1