|
--- |
|
language: tr |
|
license: mit |
|
--- |
|
🇹🇷 RoBERTaTurk-Small-Clean |
|
|
|
## Model description |
|
It was trained with a clean dataset free of typos. |
|
|
|
This is a Turkish small clean RoBERTa model, trained to understand Turkish language better. |
|
We used special, clean data from Turkish Wikipedia, Turkish OSCAR, and news websites. |
|
First, we had 38 GB of data, but we took out all the sentences with mistakes in them. |
|
So, the model was trained with 20 GB of good quality data. This helps the model work really well with Turkish texts that don't have errors. |
|
|
|
The model is a bit smaller than the usual RoBERTa model. It has 8 layers instead of 12, which makes it faster and easier to use but still very good at understanding Turkish. |
|
|
|
It's built to be really good at understanding Turkish, especially when the texts are written correctly without errors. |
|
Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 1.5M steps. |
|
|
|
# Usage |
|
Load transformers library with: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased") |
|
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased") |
|
``` |
|
|
|
# Fill Mask Usage |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
fill_mask = pipeline( |
|
"fill-mask", |
|
model="burakaytan/roberta-small-turkish-clean-uncased", |
|
tokenizer="burakaytan/roberta-small-turkish-clean-uncased" |
|
) |
|
|
|
fill_mask("iki ülke arasında <mask> başladı") |
|
|
|
[{'sequence': 'iki ülke arasında savaş başladı', |
|
'score': 0.14830906689167023, |
|
'token': 1745, |
|
'token_str': ' savaş'}, |
|
{'sequence': 'iki ülke arasında çatışmalar başladı', |
|
'score': 0.1442396193742752, |
|
'token': 18223, |
|
'token_str': ' çatışmalar'}, |
|
{'sequence': 'iki ülke arasında gerginlik başladı', |
|
'score': 0.12025047093629837, |
|
'token': 13638, |
|
'token_str': ' gerginlik'}, |
|
{'sequence': 'iki ülke arasında çatışma başladı', |
|
'score': 0.0615813322365284, |
|
'token': 5452, |
|
'token_str': ' çatışma'}, |
|
{'sequence': 'iki ülke arasında görüşmeler başladı', |
|
'score': 0.04512731358408928, |
|
'token': 4736, |
|
'token_str': ' görüşmeler'}] |
|
``` |
|
## Citation and Related Information |
|
|
|
To cite this model: |
|
```bibtex |
|
|
|
@article{aytan2023deep, |
|
title={Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model}, |
|
author={AYTAN, BURAK and {\c{S}}AKAR, CEMAL OKAN}, |
|
journal={Turkish Journal of Electrical Engineering and Computer Sciences}, |
|
volume={31}, |
|
number={3}, |
|
pages={581--595}, |
|
year={2023} |
|
} |
|
|
|
``` |