File size: 3,933 Bytes
cb9d01c 2d9442e 7a88cae 2d9442e 00ed639 2d9442e 6b233a1 2d9442e 6b233a1 2d9442e 6b233a1 2d9442e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: mit
language:
- ar
- bg
- bn
- ca
- cs
- da
- de
- el
- en
- es
- et
- eo
- fi
- fr
- he
- hr
- hu
- id
- it
- ja
- kk
- lt
- lv
- mk
- nl
- 'no'
- pl
- pt
- ro
- ru
- sk
- sl
- sq
- sr
- sv
- tr
- uk
- vi
- zh
---
# PRISM Model for Multilingual Machine Translation
This repository contains the `Prism` model, a multilingual neural machine translation (NMT) system developed for translation. The `Prism` model supports translation across 39 languages, leveraging a zero-shot paraphrasing approach that does not require human judgments for training.
The model was trained with a focus on multilingual performance, excelling in tasks such as translation quality estimation and evaluation, making it a versatile choice for research and practical use in various language pairs.
It was introduced in this [paper](https://aclanthology.org/2020.emnlp-main.8.pdf) and first released in [this](https://github.com/thompsonb/prism/tree/master) repository.
## Model Description
The `Prism` model was designed to be a lexically/syntactically unbiased paraphraser. The core idea is to treat paraphrasing as a zero-shot translation task, which allows the model to cover a wide range of languages effectively.
### BLEU Score Performance
Based on the research paper, the `Prism` model achieved competitive or superior performance across various language pairs in the WMT 2019 shared metrics task. It outperformed existing evaluation metrics in many cases, showing robustness in both high-resource and low-resource settings.
## Installation
To use `PrismTokenizer`, ensure that the `sentencepiece` package is installed, as it is a required dependency for handling multilingual tokenization.
```bash
pip install sentencepiece
```
## Usage Example
```python
from transformers import PrismForConditionalGeneration, PrismTokenizer
uk_text = "Життя як коробка шоколаду"
ja_text = "人生はチョコレートの箱のようなもの。"
model = PrismForConditionalGeneration.from_pretrained("dariast/prism")
tokenizer = PrismTokenizer.from_pretrained("dariast/prism")
# Translate Ukrainian to French
tokenizer.src_lang = "uk"
encoded_uk = tokenizer(uk_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_uk, forced_bos_token_id=tokenizer.get_lang_id("fr"), max_new_tokens=20)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
# => 'La vie comme une boîte de chocolat.'
# Translate Japanese to English
tokenizer.src_lang = "ja"
encoded_ja = tokenizer(ja_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_ja, forced_bos_token_id=tokenizer.get_lang_id("en"), max_new_tokens=20)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
# => 'Life is like a box of chocolate.'
```
## Languages Covered
Albanian (sq), Arabic (ar), Bengali (bn), Bulgarian (bg), Catalan; Valencian (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Esperanto (eo), Estonian (et), Finnish (fi), French (fr), German (de), Greek, Modern (el), Hebrew (modern) (he), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Latvian (lv), Lithuanian (lt), Macedonian (mk), Norwegian (no), Polish (pl), Portuguese (pt), Romanian, Moldovan (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovene (sl), Spanish; Castilian (es), Swedish (sv), Turkish (tr), Ukrainian (uk), Vietnamese (vi).
## Citation
If you use this model in your research, please cite the original paper:
```
@inproceedings{thompson-post-2020-automatic,
title={Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing},
author={Brian Thompson and Matt Post},
year={2020},
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
address = "Online",
publisher = "Association for Computational Linguistics",
}
``` |