metadata

datasets:
  - allenai/c4
  - legacy-datasets/mc4
language:
  - pt
pipeline_tag: text2text-generation
base_model: google-t5/t5-small

ptt5-v2-small

Introduction

ptt5-v2 models are pretrained T5 models tailored for the Portuguese language, continuing from Google's original checkpoints with sizes from t5-small to t5-3B. These checkpoints were used to train MonoT5 rerankers for the Portuguese language, you can find them at their HuggingFace collection. For further information about the pretraining process and the complete study, please refer to our paper ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language.

Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("unicamp-dl/ptt5-v2-small")
model = T5ForConditionalGeneration.from_pretrained("unicamp-dl/ptt5-v2-small")

Citation

If you use our models, please cite:

@article{ptt5_2020,
  title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
  author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
  journal={arXiv preprint arXiv:2008.09144},
  year={2020}
}