banglat5_banglaparaphrase

This repository contains the pretrained checkpoint of the model BanglaT5 finetuned on BanglaParaphrase dataset. This is a sequence to sequence transformer model pretrained with the "Span Corruption" objective. Finetuned models using this checkpoint achieve competitive results on the dataset.

For finetuning and inference, refer to the scripts in the official GitHub repository of BanglaNLG.

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in the official GitHub repository use this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is given below:

Using this model in transformers

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from normalizer import normalize # pip install git+https://github.com/csebuetnlp/normalizer

model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_banglaparaphrase")
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglat5_banglaparaphrase", use_fast=False)

input_sentence = ""
input_ids = tokenizer(normalize(input_sentence), return_tensors="pt").input_ids
generated_tokens = model.generate(input_ids)
decoded_tokens = tokenizer.batch_decode(generated_tokens)[0]

print(decoded_tokens)

Benchmarks

  • Supervised fine-tuning
Test Set Model sacreBLEU ROUGE-L PINC BERTScore BERT-iBLEU
BanglaParaphrase BanglaT5
IndicBART
IndicBARTSS
32.8
5.60
4.90
63.58
35.61
33.66
74.40
80.26
82.10
94.80
91.50
91.10
92.18
91.16
90.95
IndicParaphrase BanglaT5
IndicBART
IndicBARTSS
11.0
12.0
10.7
19.99
21.58
20.59
74.50
76.83
77.60
94.80
93.30
93.10
87.738
90.65
90.54

The dataset can be found in the link below:

Citation

If you use this model, please cite the following paper:

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}
Downloads last month
695,026
Inference API

Spaces using csebuetnlp/banglat5_banglaparaphrase 3