Automatic correction of README.md metadata for keys. Contact [email protected] for any question
cecfcbf
language: | |
- tr | |
thumbnail: | |
tags: | |
- gpt2 | |
- turkish | |
license: apache-2.0 | |
datasets: | |
- wikipedia-turkish | |
metrics: | |
- perplexity | |
- accuracy | |
widget: | |
- text: Bu yazıyı bir bilgisayar yazdı. Yazarken | |
context: '' | |
- text: İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda | |
context: '' | |
# Turkish GPT2 Model Finetuned | |
# Türkçe GPT2 Modeli | |
## Model description | |
This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020 | |
Live demo based on this work at : https://www.metayazar.com/ | |
Fine tuned writer on this model: https://huggingface.co./gorkemgoknar/gpt2-turkish-writer | |
Work has been done on Pierre Guillou tutorial as on this page. | |
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) | |
Code is converted to work with Fastai 2.X . | |
Using Google Colab for training. | |
Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage. | |
Current accuracy 33 % , Perplexity : 51.88 | |
Models are available: | |
* [gpt2-small-tuned-tr] (https://huggingface.co./gorkemgoknar/gpt2-small-turkish) | |
* [gpt2-small-turkish-writer] (https://huggingface.co./gorkemgoknar/gpt2-turkish-writer) | |
## Intended uses & limitations | |
#### How to use | |
#### Install | |
```python | |
from transformers import AutoTokenizer, AutoModelWithLMHead | |
import torch | |
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish") | |
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish") | |
# Get sequence length max of 1024 | |
tokenizer.model_max_length=1024 | |
model.eval() # disable dropout (or leave in train mode to finetune) | |
``` | |
#### Generate 1 word | |
```python | |
# input sequence | |
text = "Bu yazıyı bilgisayar yazdı." | |
inputs = tokenizer(text, return_tensors="pt") | |
# model output | |
outputs = model(**inputs, labels=inputs["input_ids"]) | |
loss, logits = outputs[:2] | |
predicted_index = torch.argmax(logits[0, -1, :]).item() | |
predicted_text = tokenizer.decode([predicted_index]) | |
# results | |
print('input text:', text) | |
print('predicted text:', predicted_text) | |
# input text: | |
# predicted text: | |
``` | |
#### Generate Full Sequence | |
```python | |
# input sequence | |
text = "Bu yazıyı bilgisayar yazdı." | |
inputs = tokenizer(text, return_tensors="pt") | |
# model output using Top-k sampling text generation method | |
sample_outputs = model.generate(inputs.input_ids, | |
pad_token_id=50256, | |
do_sample=True, | |
max_length=50, # put the token number you want | |
top_k=40, | |
num_return_sequences=1) | |
# generated sequence | |
for i, sample_output in enumerate(sample_outputs): | |
print(">> Generated text {}\\\\ | |
\\\\ | |
{}".format(i+1, tokenizer.decode(sample_output.tolist()))) | |
# >> Generated text | |
# | |
``` | |
#### Limitations and bias | |
The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. | |
## Training data | |
Wikipedia Turkish article dump as of 28-10-2020 | |
## Training procedure | |
## Eval results | |
| epoch\\\\t|train_loss\\\\t|valid_loss\\\\t|accuracy\\\\t|perplexity\\\\t|time | | |
| ----- | -------- |--------- | ---------- | --------- | ----- | | |
|0\\\\t|4.777015\\\\t|4.621834\\\\t|0.292547\\\\t|101.680367\\\\t|2:42:05| | |
|1\\\\t|4.509412\\\\t|4.403999\\\\t|0.305574\\\\t|81.777267\\\\t|1:09:38| | |
|2\\\\t|4.169529\\\\t|4.120755\\\\t|0.324908\\\\t|61.605747\\\\t|1:07:45| | |
|3\\\\t|4.293973\\\\t|4.177899\\\\t|0.317211\\\\t|65.228653\\\\t|1:07:02| | |
|4\\\\t|4.049848\\\\t|3.949103\\\\t|0.338347\\\\t|51.888783\\\\t|1:05:53| | |
#Epoch 0 on Tesla T4, others on V100 | |
``` | |