language:
- tr
thumbnail: null
tags:
- gpt2
- turkish
license: apache-2.0
datasets:
- wikipedia-turkish
metrics:
- perplexity
- accuracy
widget:
- text: Bu yazıyı bir bilgisayar yazdı. Yazarken
context: ''
- text: İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda
context: ''
Turkish GPT2 Model Finetuned
Türkçe GPT2 Modeli
Model description
This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020
Live demo based on this work at : https://www.metayazar.com/
Fine tuned writer on this model: https://huggingface.co./gorkemgoknar/gpt2-turkish-writer
Work has been done on Pierre Guillou tutorial as on this page. (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)
Code is converted to work with Fastai 2.X .
Using Google Colab for training.
Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.
Current accuracy 33 % , Perplexity : 51.88
Models are available:
- [gpt2-small-tuned-tr] (https://huggingface.co./gorkemgoknar/gpt2-small-turkish)
- [gpt2-small-turkish-writer] (https://huggingface.co./gorkemgoknar/gpt2-turkish-writer)
Intended uses & limitations
How to use
Install
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")
# Get sequence length max of 1024
tokenizer.model_max_length=1024
model.eval() # disable dropout (or leave in train mode to finetune)
Generate 1 word
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])
# results
print('input text:', text)
print('predicted text:', predicted_text)
# input text:
# predicted text:
Generate Full Sequence
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
pad_token_id=50256,
do_sample=True,
max_length=50, # put the token number you want
top_k=40,
num_return_sequences=1)
# generated sequence
for i, sample_output in enumerate(sample_outputs):
print(">> Generated text {}\\\\
\\\\
{}".format(i+1, tokenizer.decode(sample_output.tolist())))
# >> Generated text
#
Limitations and bias
The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.
Training data
Wikipedia Turkish article dump as of 28-10-2020
Training procedure
Eval results
epoch\\t | train_loss\\t | valid_loss\\t | accuracy\\t | perplexity\\t | time |
---|---|---|---|---|---|
0\\t | 4.777015\\t | 4.621834\\t | 0.292547\\t | 101.680367\\t | 2:42:05 |
1\\t | 4.509412\\t | 4.403999\\t | 0.305574\\t | 81.777267\\t | 1:09:38 |
2\\t | 4.169529\\t | 4.120755\\t | 0.324908\\t | 61.605747\\t | 1:07:45 |
3\\t | 4.293973\\t | 4.177899\\t | 0.317211\\t | 65.228653\\t | 1:07:02 |
4\\t | 4.049848\\t | 3.949103\\t | 0.338347\\t | 51.888783\\t | 1:05:53 |
#Epoch 0 on Tesla T4, others on V100