File size: 6,146 Bytes
816741d
1fa552f
 
 
816741d
1fa552f
34f2354
1fa552f
 
 
34f2354
2ecc538
64df6f3
34f2354
 
 
816741d
1fa552f
6f51169
 
1fa552f
6b02420
 
 
 
 
 
1fa552f
6b02420
 
0c6f6c0
1fa552f
 
3df5dc2
1fa552f
 
 
6b02420
 
 
 
1fa552f
 
 
3df5dc2
 
 
 
 
 
6b02420
 
 
 
 
 
1fa552f
eeaaffa
 
 
 
 
1fa552f
 
6b02420
 
 
3df5dc2
d845ecf
3df5dc2
 
d845ecf
 
 
 
 
 
 
 
 
3df5dc2
 
1fa552f
 
 
eeaaffa
d845ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fa552f
 
 
d845ecf
 
 
 
 
 
3df5dc2
4aa1634
 
2ecc538
4aa1634
 
d845ecf
 
4aa1634
2ecc538
4aa1634
3df5dc2
 
eeaaffa
6b02420
 
 
 
eeaaffa
 
 
 
 
6b02420
 
 
 
eeaaffa
 
6b02420
 
 
eeaaffa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
language:
- br
- fr
license: mit
tags:
  - translation
model-index:
- name: m2m100_br_fr
  results: []
co2_eq_emissions:
  emissions: 3300
  source: "https://mlco2.github.io/impact"
  training_type: "fine-tuning"
  geographical_location: "Paris, France"
  hardware_used: "2 NVidia GeForce RTX 3090 GPUs"
---

Breton-French translator `m2m100_418M_br_fr`
============================================

This model is a fine-tuned version of
[facebook/m2m100_418M](https://huggingface.co./facebook/m2m100_418M) (Fan et al., 2021) on a
Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel
data on training and consequently report no quantitative evaluation at this time. Empirical
qualitative evidence suggests that the translations are generally adequate for short and simple
examples, the behaviour of the model on long and/or complex inputs is currently unknown.

Try this model online in [Troer](https://huggingface.co./spaces/lgrobol/troer), feedback and
suggestions are welcome!

## Model description

See the description of the [base model](https://huggingface.co./facebook/m2m100_418M).

## Intended uses & limitations

This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale
many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far
as I can tell it usually provides translations that are least as good as those of other available
Breton-French translators, but it has not been evaluated quantitatively at a large scale.

## Training and evaluation data

The training dataset consists of:

- The [OfisPublik corpus v1](https://opus.nlpl.eu/OfisPublik-v1.php) (Tyers, 2009)
- The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
- Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)

These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using
[OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see
[`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to
the retraining of a statistical alignment model, but in my experience, different runs tend to give
extremely similar results. Do not hesitate to reach out if you experience difficulties in using this
to collect data.

In addition to these, the training dataset also includes parallel br/fr sentences, provided as
glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
[ongoing port](https://github.com/Autogramm/Breton/commit/45ac2c444a979b7ee41e5f24a3bfd1ec39f09d7d)
to Universal Dependencies in the Autogramm project.

## Training procedure

The training hyperparameters are those suggested by Adelani et al. (2022) in their [code
release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine
translation of several African languages.

More specifically, we train this model with [zeldarose](https://github.com/LoicGrobol/zeldarose) with the following parameters

```bash
zeldarose transformer \
   --config train_config.toml \
   --tokenizer "facebook/m2m100_418M" --pretrained-model "facebook/m2m100_418M" \
   --out-dir m2m100_418M+br-fr --model-name m2m100_418M+br-fr \
   --strategy ddp --accelerator gpu --num-devices 4 --device-batch-size 2 --num-workers 8\
   --max-epochs 16 --precision 16 --tf32-mode medium \
   --val-data {val_path}.jsonl \
   {train_path}.jsonl

```

### Training hyperparameters

The following hyperparameters were used during training:

```toml
[task]
change_ratio = 0.3
denoise_langs = []
poisson_lambda = 3.0
source_langs = ["br"]
target_langs = ["fr"]

[tuning]
batch_size = 16
betas = [0.9, 0.999]
epsilon = 1e-8
learning_rate = 5e-5
gradient_clipping = 1.0
lr_decay_steps = -1
warmup_steps = 1024
```

### Framework versions

- Transformers 4.26.1
- Pytorch 1.12.1
- Datasets 2.10.0
- Tokenizers 0.13.2
- Pytorch-lightning 1.9.3
- Zeldarose [c6456ead](https://github.com/LoicGrobol/spertiniite/commit/c6456ead3649c4e6ddfb4a5a74b40f344eded09f)

### Carbon emissions

At this time, we estimate emissions of a rough 300 gCO<sub>2</sub>eq per fine-tuning run. So far, we
account for

- Fine-tuning the 3 released versions
- 8 development runs

Therefore, so far, the equivalent carbon emissions for this model are approximately 3300 gCO<sub>2</sub>eq.

## References

- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
  et al. 2022. “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African
  News Translation”. In Proceedings of the 2022 Conference of the North American Chapter of the
  Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United
  States: Association for Computational Linguistics.
  <https://doi.org/10.18653/v1/2022.naacl-main.223>.
- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
  Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
  Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
  Linguistics.
- Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
  Baines, et al. 2021. “Beyond english-centric multilingual machine translation”. The Journal of
  Machine Learning Research 22 (1): 107:4839-107:4886.
- Tiedemann, Jorg 2012, “Parallel Data, Tools and Interfaces in OPUS”. In Proceedings of the 8th
  International Conference on Language Resources and Evaluation (LREC 2012)
- Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
  ressources pour son étude linguistique formelle, IKER, CNRS, <http://arbres.iker.cnrs.fr>.
- Tyers, Francis M. 2009 “Rule-based augmentation of training data in Breton-French statistical
  machine translation”. In Proceedings of the 13th Annual Conference of the European Association of
  Machine Translation, EAMT09. Barcelona, España. 213--218