|
--- |
|
language: vi |
|
tags: |
|
- vi |
|
- vietnamese |
|
- gpt2 |
|
- text-generation |
|
- lm |
|
- nlp |
|
datasets: |
|
- oscar |
|
widget: |
|
- text: "Việt Nam là quốc gia có" |
|
--- |
|
|
|
# GPT-2 |
|
|
|
Pretrained model on Vietnamese language using a causal language modeling (CLM) objective. It was introduced in |
|
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |
|
and first released at [this page](https://openai.com/blog/better-language-models/). |
|
|
|
# How to use the model |
|
|
|
~~~~ |
|
import torch |
|
from transformers import GPT2Tokenizer, GPT2LMHeadModel |
|
|
|
tokenizer = GPT2Tokenizer.from_pretrained('NlpHUST/gpt2-vietnamese') |
|
model = GPT2LMHeadModel.from_pretrained('NlpHUST/gpt2-vietnamese') |
|
|
|
text = "Việt Nam là quốc gia có" |
|
input_ids = tokenizer.encode(text, return_tensors='pt') |
|
max_length = 100 |
|
|
|
sample_outputs = model.generate(input_ids,pad_token_id=tokenizer.eos_token_id, |
|
do_sample=True, |
|
max_length=max_length, |
|
min_length=max_length, |
|
top_k=40, |
|
num_beams=5, |
|
early_stopping=True, |
|
no_repeat_ngram_size=2, |
|
num_return_sequences=3) |
|
|
|
for i, sample_output in enumerate(sample_outputs): |
|
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist()))) |
|
print('\n---') |
|
~~~~ |
|
|
|
# Model architecture |
|
A 12-layer, 768-hidden-size transformer-based language model. |
|
|
|
# Training |
|
The model was trained on Vietnamese Oscar dataset (32 GB) to optimize a traditional language modelling objective on v3-8 TPU for around 6 days. It reaches around 13.4 perplexity on a chosen validation set from Oscar. |
|
|
|
### GPT-2 Finetuning |
|
|
|
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before |
|
the tokenization). The loss here is that of causal language modeling. |
|
|
|
The script [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) . |
|
|
|
```bash |
|
python run_clm.py \ |
|
--model_name_or_path NlpHUST/gpt2-vietnamese \ |
|
--dataset_name wikitext \ |
|
--dataset_config_name wikitext-2-raw-v1 \ |
|
--per_device_train_batch_size 8 \ |
|
--per_device_eval_batch_size 8 \ |
|
--do_train \ |
|
--do_eval \ |
|
--output_dir /tmp/test-clm |
|
``` |
|
|