File size: 2,127 Bytes
5ab419d
 
5dfb023
 
 
 
 
 
 
 
 
 
 
 
4a5e650
5ab419d
5dfb023
8efcba2
5dfb023
 
 
 
 
302a420
5dfb023
 
 
 
 
 
302a420
5dfb023
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
datasets:
- opus_books
- iwslt2017
language:
- en
- nl
metrics:
- sacrebleu
pipeline_tag: text2text-generation
tags:
- translation
widget:
- text: ">>en<< Was het leuk?"
---

**NOTE:** This is a work-in-progress model that is **not** considered finished. Keep this in mind when using this model, or continue training this model.

# Model Card for mt5-small nl-en translation

The mt5-small nl-en translation model is a finetuned version of [google/mt5-small](https://huggingface.co./google/mt5-small).

It was finetuned on 237k rows of the [iwslt2017](https://huggingface.co./datasets/iwslt2017/viewer/iwslt2017-en-nl) dataset and roughly 38k rows of the [opus_books](https://huggingface.co./datasets/opus_books/viewer/en-nl) dataset. The model was trained in multiple phases with different epochs & batch sizes.


## How to use

**Install dependencies**
```bash
pip install transformers, sentencepiece, protobuf
```

You can use the following code for model inference. This model was finetuned to work with an identifier when prompted that needs to be present for the best results.

```Python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Michielo/mt5-small_nl-en_translation")
model = AutoModelForSeq2SeqLM.from_pretrained("Michielo/mt5-small_nl-en_translation")

translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
inputs = tokenizer(">>en<< Your dutch text here", return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
```


## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.