burakaytan
/

roberta-small-turkish-clean-uncased

Inference Endpoints

Model card Files Files and versions Community

roberta-small-turkish-clean-uncased / README.md

burakaytan's picture

Update README.md

c6f3c65 12 months ago

|

history blame contribute delete

2.69 kB

	---
	language: tr
	license: mit
	---
	🇹🇷 RoBERTaTurk-Small-Clean

	## Model description
	It was trained with a clean dataset free of typos.

	This is a Turkish small clean RoBERTa model, trained to understand Turkish language better.
	We used special, clean data from Turkish Wikipedia, Turkish OSCAR, and news websites.
	First, we had 38 GB of data, but we took out all the sentences with mistakes in them.
	So, the model was trained with 20 GB of good quality data. This helps the model work really well with Turkish texts that don't have errors.

	The model is a bit smaller than the usual RoBERTa model. It has 8 layers instead of 12, which makes it faster and easier to use but still very good at understanding Turkish.

	It's built to be really good at understanding Turkish, especially when the texts are written correctly without errors.
	Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 1.5M steps.

	# Usage
	Load transformers library with:
	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
	model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
	```

	# Fill Mask Usage

	```python
	from transformers import pipeline

	fill_mask = pipeline(
	"fill-mask",
	model="burakaytan/roberta-small-turkish-clean-uncased",
	tokenizer="burakaytan/roberta-small-turkish-clean-uncased"
	)

	fill_mask("iki ülke arasında <mask> başladı")

	[{'sequence': 'iki ülke arasında savaş başladı',
	'score': 0.14830906689167023,
	'token': 1745,
	'token_str': ' savaş'},
	{'sequence': 'iki ülke arasında çatışmalar başladı',
	'score': 0.1442396193742752,
	'token': 18223,
	'token_str': ' çatışmalar'},
	{'sequence': 'iki ülke arasında gerginlik başladı',
	'score': 0.12025047093629837,
	'token': 13638,
	'token_str': ' gerginlik'},
	{'sequence': 'iki ülke arasında çatışma başladı',
	'score': 0.0615813322365284,
	'token': 5452,
	'token_str': ' çatışma'},
	{'sequence': 'iki ülke arasında görüşmeler başladı',
	'score': 0.04512731358408928,
	'token': 4736,
	'token_str': ' görüşmeler'}]
	```
	## Citation and Related Information

	To cite this model:
	```bibtex

	@article{aytan2023deep,
	title={Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model},
	author={AYTAN, BURAK and {\c{S}}AKAR, CEMAL OKAN},
	journal={Turkish Journal of Electrical Engineering and Computer Sciences},
	volume={31},
	number={3},
	pages={581--595},
	year={2023}
	}

	```