Model Card for tibetan-phonetic-transliteration
This model is a text2text generation model for phonetic transliteration of Tibetan script.
Model Details
Model Description
- Developed by: billingsmoore
- Model type: text2text generation
- Language(s) (NLP): Tibetan
- License: [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International )
- Finetuned from model: 'google-t5/t5-small'
Model Sources
- Repository: https://github.com/billingsmoore/MLotsawa
Uses
The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem.
Direct Use
To use the model for transliteration in a python script, you can use the transformers library like so:
from transformers import pipeline
transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration')
transliterated_text = transliterator(<string of unicode Tibetan script>)
Downstream Use
The model can be finetuned for a specific use case using the following code.
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor
from accelerate import Accelerator
dataset = load_dataset(<your dataset>)
dataset = dataset['train'].train_test_split(.1)
checkpoint = "billingsmoore/tibetan-phonetic-transliteration"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
source_lang = 'bo'
target_lang = 'phon'
def preprocess_function(examples):
inputs = [example for example in examples[source_lang]]
targets = [example for example in examples[target_lang]]
model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length")
return model_inputs
tokenized_dataset = dataset.map(preprocess_function, batched=True)
optimizer = Adafactor(
model.parameters(),
scale_parameter=True,
relative_step=False,
warmup_init=False,
lr=3e-4
)
accelerator = Accelerator()
model, optimizer = accelerator.prepare(model, optimizer)
training_args = Seq2SeqTrainingArguments(
output_dir=".",
auto_find_batch_size=True,
predict_with_generate=True,
fp16=False,
push_to_hub=False,
eval_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
num_train_epochs=5
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['test'],
tokenizer=tokenizer,
optimizers=(optimizer, None),
data_collator=data_collator
)
trainer.train()
Bias, Risks, and Limitations
This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan. It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan.
Recommendations
For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above.
Training Details
This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first. This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced. You can find this dataset and more information on Kaggle by clicking here. You can find this dataset and more information on Huggingface by clicking here.
This model was trained for five epochs. Further information regarding training can be found in the documentation of the MLotsawa repository.
Model Card Contact
billingsmoore [at] gmail [dot] com
- Downloads last month
- 2
Model tree for billingsmoore/tibetan-phonetic-transliteration
Base model
google-t5/t5-small