File size: 3,184 Bytes
766c783 6404707 766c783 e5e8bdf 766c783 e70d0d5 766c783 e70d0d5 12d7b1c 766c783 e70d0d5 d791ad1 e70d0d5 d791ad1 e70d0d5 d791ad1 e70d0d5 d791ad1 e70d0d5 d791ad1 e70d0d5 d791ad1 e70d0d5 d791ad1 1466541 766c783 04c0463 766c783 04c0463 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
datasets:
- unicamp-dl/mmarco
language:
- pt
pipeline_tag: text2text-generation
base_model: unicamp-dl/ptt5-v2-small
license: apache-2.0
---
## Introduction
MonoPTT5 models are T5 rerankers for the Portuguese language. Starting from [ptt5-v2 checkpoints](https://huggingface.co./collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0), they were trained for 100k steps on a mixture of Portuguese and English data from the mMARCO dataset.
For further information on the training and evaluation of these models, please refer to our paper, [ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language](https://arxiv.org/abs/2008.09144).
## Usage
The easiest way to use our models is through the `rerankers` package. After installing the package using `pip install rerankers[transformers]`, the following code can be used as a minimal working example:
```python
from rerankers import Reranker
import torch
query = "O futebol é uma paixão nacional"
docs = [
"O futebol é superestimado e não deveria receber tanta atenção.",
"O futebol é uma parte essencial da cultura brasileira e une as pessoas.",
]
ranker = Reranker(
"unicamp-dl/monoptt5-small",
inputs_template="Pergunta: {query} Documento: {text} Relevante:",
dtype=torch.float32 # or bfloat16 if supported by your GPU
)
results = ranker.rank(query, docs)
print("Classification results:")
for result in results:
print(result)
# Loading T5Ranker model unicamp-dl/monoptt5-small
# No device set
# Using device cuda
# Using dtype torch.float32
# Loading model unicamp-dl/monoptt5-small, this might take a while...
# Using device cuda.
# Using dtype torch.float32.
# T5 true token set to ▁Sim
# T5 false token set to ▁Não
# Returning normalised scores...
# Inputs template set to Pergunta: {query} Documento: {text} Relevante:
# Classification results:
# document=Document(text='O futebol é uma parte essencial da cultura brasileira e une as pessoas.', doc_id=1, metadata={}) score=0.9192759394645691 rank=1
# document=Document(text='O futebol é superestimado e não deveria receber tanta atenção.', doc_id=0, metadata={}) score=0.026855656877160072 rank=2
```
For additional configurations and more advanced usage, consult the `rerankers` [GitHub repository](https://github.com/AnswerDotAI/rerankers).
## Citation
If you use our models, please cite:
```
@misc{piau2024ptt5v2,
title={ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language},
author={Marcos Piau and Roberto Lotufo and Rodrigo Nogueira},
year={2024},
eprint={2406.10806},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
``` |