|
--- |
|
language: |
|
- tr |
|
tags: |
|
- roberta |
|
license: cc-by-nc-sa-4.0 |
|
--- |
|
|
|
# RoBERTweetTurkCovid (uncased) |
|
|
|
Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased. |
|
The pretrained corpus is a Turkish tweets collection related to COVID-19. |
|
|
|
Model architecture is similar to RoBERTa-base (12 layers, 12 heads, and 768 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 30k. |
|
|
|
The details of pretraining can be found at this paper: |
|
```bibtex |
|
@InProceedings{clef-checkthat:2022:task1:oguzhan, |
|
author = {Cagri Toraman and Oguzhan Ozcelik and Furkan Şahinuç and Umitcan Sahin}, |
|
title = "{ARC-NLP at CheckThat! 2022:} Contradiction for Harmful Tweet Detection", |
|
year = {2022}, |
|
booktitle = "Working Notes of {CLEF} 2022 - Conference and Labs of the Evaluation Forum", |
|
editor = {Faggioli, Guglielmo andd Ferro, Nicola and Hanbury, Allan and Potthast, Martin}, |
|
series = {CLEF~'2022}, |
|
address = {Bologna, Italy}, |
|
} |
|
``` |
|
|
|
The following code can be used for model loading and tokenization, example max length (768) can be changed: |
|
``` |
|
model = AutoModel.from_pretrained([model_path]) |
|
#for sequence classification: |
|
#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes]) |
|
|
|
tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path]) |
|
tokenizer.mask_token = "[MASK]" |
|
tokenizer.cls_token = "[CLS]" |
|
tokenizer.sep_token = "[SEP]" |
|
tokenizer.pad_token = "[PAD]" |
|
tokenizer.unk_token = "[UNK]" |
|
tokenizer.bos_token = "[CLS]" |
|
tokenizer.eos_token = "[SEP]" |
|
tokenizer.model_max_length = 768 |
|
``` |
|
|
|
### BibTeX entry and citation info |
|
```bibtex |
|
@InProceedings{clef-checkthat:2022:task1:oguzhan, |
|
author = {Cagri Toraman and Oguzhan Ozcelik and Furkan Şahinuç and Umitcan Sahin}, |
|
title = "{ARC-NLP at CheckThat! 2022:} Contradiction for Harmful Tweet Detection", |
|
year = {2022}, |
|
booktitle = "Working Notes of {CLEF} 2022 - Conference and Labs of the Evaluation Forum", |
|
editor = {Faggioli, Guglielmo andd Ferro, Nicola and Hanbury, Allan and Potthast, Martin}, |
|
series = {CLEF~'2022}, |
|
address = {Bologna, Italy}, |
|
} |
|
``` |
|
|