File size: 10,226 Bytes

9ae62ec
 
06e2870
 
 
 
 
 
 
9ae62ec
06e2870

---
license: apache-2.0
language:
- my
tags:
- burmese
- transformer
- text-generation-inference
- nlp
---

# Burmese RoBERTa Text Generation

## Description
The model is fine-tuned from the previous [BurmeseRoBERTa](https://huggingface.co./saihtaungkham/BurmeseRoBERTa) model and trained using Causal Language Modeling (CLM) with the following datasets:

1. `oscar-corpus/OSCAR-2301`
2. `5w4n/OSCAR-2019-Burmese-fix`
3. Wikipedia
4. [myParaphrase](https://github.com/ye-kyaw-thu/myParaphrase)
5. [myanmar_news](https://huggingface.co./datasets/myanmar_news)
6. [FLORES-200](https://github.com/facebookresearch/flores/tree/main/flores200)
7. [myPOS](https://github.com/ye-kyaw-thu/myPOS.git)
8. [BurmeseProverbDataSet](https://github.com/vincent-paing/BurmeseProverbDataSet.git)
9. [TALPCo](https://github.com/matbahasa/TALPCo.git)

## Model Usage

```python
from transformers import pipeline

MODEL_NAME = "saihtaungkham/BurmeseRoBERTaCLM"
generator = pipeline("text-generation", model=MODEL_NAME, tokenizer=MODEL_NAME)

prompt = "မြန်မာနိုင်ငံနှင့် ထိုင်းနိုင်ငံ"
print(generator(prompt))

# Output
# [{'generated_text': 'မြန်မာနိုင်ငံနှင့် ထိုင်းနိုင်ငံ နှစ်နိုင်ငံ ပူးပေါင်း ဆောင်ရွက် မှု ကို ဆောင်ရွက် ရန် သဘောတူ '}]
```

## Adjust the model output
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "saihtaungkham/BurmeseRoBERTaCLM"

model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

def generate_story(prompt,
                   model,
                   tokenizer,
                   # If step_token exceeds 512, it throws the index out-of-bounds error. The sweet spot is between 100 and 200.
                   step_token=100, # Max token generated each time while the model is running.
                   generate_steps=10, # How long to run model generation with previous input.
                   do_sample=True, # Enable model output tunning.
                   top_k=50, # Top words that the model predicted with higher scores.
                   top_p=0.95, # For model creativity. It is recommended to set it higher over 0.8 for better output.
                   num_last_sentence_windows=1, # Number of previous sentences used for better model reference. The sweet spot is 1 and 2.
                   print_internal=True # Whether showing the internal stage while the model is running.
                   ):
    outputs = ""
    def generate_tokens(prompt):
        inputs = tokenizer(prompt,
                           max_length=512,
                           truncation=True,
                           return_tensors="pt").input_ids
        inference_results = model.generate(
            inputs,
            max_new_tokens=step_token,
            do_sample=do_sample,
            top_k=top_k,
            top_p=top_p)
        return tokenizer.batch_decode(
            inference_results,
            skip_special_tokens=True
            )[0]
    outputs += generate_tokens(prompt)
    if print_internal:
        print(outputs)
    for _ in range(generate_steps -1):
        content = outputs.split("။")
        if len(content) > num_last_sentence_windows:
            content = content[-num_last_sentence_windows:]
            content = "။".join(content)
            inter_output = generate_tokens(content)
            inter_output = inter_output.split("။")
            fresh_content = inter_output[num_last_sentence_windows:]
            if print_internal:
                print("။".join(fresh_content))
            outputs += "။".join(fresh_content)
        else:
            inter_output = generate_tokens(outputs.strip())
            if print_internal:
                print(inter_output)
            outputs = inter_output

    return outputs

prompt = "ရန်ကုန်မြို့နေပြည်သူများ"

output = generate_story(
    model=model,
    prompt=prompt,
    tokenizer=tokenizer,
    step_token=100,
    generate_steps=5,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_last_sentence_windows=1,
    print_internal=True
)
print(output)

```
```shell
ရန်ကုန်မြို့နေပြည်သူများ ပိုမိုလုံခြုံမှုရှိစေဖို့ အစိုးရမဟုတ်တဲ့ အဖွဲ့အစည်းတစ်ခုအနေနဲ့ ပြည်သူကို ထိန်းကျောင်းဖို့ အရေးကြီးတယ်လို့ ဆိုပါတယ်။ ပြည်သူတွေ ဒီအခြေအနေတွေကို
သိရှိအောင် ဘယ်လိုလုပ်ဆောင်ရမလဲ ဆိုတာတွေကိုလည်း ဗွီအိုအေမြန်မာပိုင်းက မဆုမွန် ဆက်သွယ်မေးမြန်းထားပါတယ်။မေး။ ။ ဒီနေ့ ရန်ကုန်မှာ ဖြစ်နေတဲ့ ပဋိပက္ခကြီးကတော့
အကြမ်းဖက်တိုက်ဖျက်ရေး စစ်တပ်က အာဏာသိမ်းတဲ့ လုပ်ရပ်ပေါ့နော်။ စစ်ကောင်စီဘက်ကလည်း ပြည်သူတွေကို စနစ်တကျ ထိန်းကျောင်းဖို့ တာဝန်ရှိတယ်။ ပြည်သူက
ဘယ်လောက်လေးစားတယ်၊ ဘယ်လိုခံစားရတယ်၊ ဘာတွေ အလေးထားတယ်၊ ဘယ်လောက်အထိ လိုက်နာတယ်၊ ဘာကြောင့်” ဒီအကြောင်းနဲ့ ပတ်သက်ပြီး မြန်မာနိုင်ငံဆိုင်ရာ
ကုလသံ၀တမန် လက်ထောက် အမြဲတမ်းအတွင်း၀န် မစ္စ ဗီယန်ကျန်းက “ကျနော်တို့ဟာ လူ့အခွင့်အရေးချိုးဖောက်မှုတွေ ကျူးလွန်ခံနေရတာကို ကမ္ဘာက မြင်အောင် လုပ်ဖို့နဲ့ ဒီပဋိပက္ခဟာ
ပိုပြီးတော့ ရှုပ်ထွေးလာတယ်။ ပိုပြီးတော့ ရှုပ်ထွေးလာတာကို တွေ့နိုင်တယ်။ အထူးသဖြင့် ရခိုင်ပြည်နယ်မှာ ဒုက္ခသည်တွေရဲ့ ကျန်းမာရေး စောင့်ရှောက်မှုတွေကို ရခိုင်ပြည်နယ်ထဲက
မြို့နယ်တွေမှာ လုပ်ဆောင်တာဖြစ်သလို ဒုက္ခသည်စခန်းတွေ မှာ ဆေးကုသဖို့ လိုအပ်တဲ့ ဆေးဝါးတွေ လိုအပ်နေတယ်လို့ UNHCR က ဆိုပါတယ်။ ဒီအကြောင်းနဲ့ ပတ်သက်ပြီး
အပြည့်အစုံကိုတော့ ထိုင်းအခြေစိုက် ဗွီအိုအေသတင်းထောက် မအေးအေးမာက သတင်းပေးပို့ထားပါတယ်။ရခိုင်မြောက်ပိုင်း ဘူးသီးတောင်နဲ့ ရသေ့တောင်မြို့နယ်က စစ်ဘေးဒုက္ခသည်တွေ၊
ဒေသခံ ဒီအကြောင်း မဆုမွန် စုစည်းတင်ပြပေးထားပါတယ်။ရခိုင်မြောက်ပိုင်း မောင်တောနဲ့ ဘူးသီးတောင်မြို့နယ်က စစ်ဘေးဒုက္ခသည် IDP တွေ စားဝတ်နေရေးအခက်ခဲတွေ ကြုံနေရလို့
ဘင်္ဂလားဒေ့ရှ် အစိုးရက စားသောက်ကုန်တွေ၊ အဝတ်အထည်တွေနဲ့ စားနပ်ရိက္ခာတွေကို အကူအညီပေးနေပြီး ရခိုင်ပြည်နယ်တွင်းမှာ စစ်ဘေးရှောင်ဒုက္ခသည်တွေအတွက် စားနပ်ရိက္ခာ၊
ဆေးဝါးနဲ့ စားသောက်ကုန် အကူအညီတွေ အမြန်ဆုံး ကူညီပေးဖို့ အစိုးရကို တောင်းဆိုထားပါတယ်။ဘူးသီးတောင်မြို့နယ်၊ ထို့နောက် တပ်မတော်စစ်ကြောင်းများက မောင်တောမြို့နယ်
တောင်ပိုင်း၊ တဟီရကျေးရွာကို စီးနင်း တိုက်ခိုက်ခဲ့ကာ ဒေသခံပြည်သူများအား စစ်မေး မေးမြန်းရာတွင် မြန်မာ့တပ်မတော်က မိုင်း
```

# Warning
This model uses internet-curated data and may contain bias, violence, explicit language, sexual content, and harmful responses. Please use it with care.

# Credit
I thank the original author and contributor mentioned in the dataset sections.
We have the technologies but need the datasets to make the model work. The transformer model has been available since 2017. However, it is still challenging to train the model due to the low language resources available over the internet. This model will be a stepping stone for us to create a more open model for the Myanmar language and benefit our community.
Anyone is welcome to contact me regarding the dataset license and contribution to the improvement of this model.