lgrobol commited on
Commit
3df5dc2
1 Parent(s): 1fa552f

Add initial README info

Browse files
Files changed (1) hide show
  1. README.md +39 -12
README.md CHANGED
@@ -10,27 +10,52 @@ model-index:
10
  results: []
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
 
16
- # m2m100_br_fr
17
-
18
- This model is a fine-tuned version of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on an unknown dataset.
19
 
20
  ## Model description
21
 
22
- More information needed
23
 
24
  ## Intended uses & limitations
25
 
26
- More information needed
27
 
28
  ## Training and evaluation data
29
 
30
- More information needed
 
 
 
 
 
 
31
 
32
  ## Training procedure
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ### Training hyperparameters
35
 
36
  The following hyperparameters were used during training:
@@ -42,13 +67,15 @@ The following hyperparameters were used during training:
42
  - lr_scheduler_type: linear
43
  - num_epochs: 3.0
44
 
45
- ### Training results
46
-
47
-
48
-
49
  ### Framework versions
50
 
51
  - Transformers 4.23.1
52
  - Pytorch 1.12.1+cu116
53
  - Datasets 2.6.1
54
  - Tokenizers 0.13.1
 
 
 
 
 
 
 
10
  results: []
11
  ---
12
 
13
+ `m2m100_br_fr` Breton-French translator
14
+ =======================================
15
 
16
+ This model is a fine-tuned version of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on a Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel data on training and consequently report no quantitative evaluation at this time. Empirical qualitative evidence suggests that the translations are generally adequate for short and simple examples, the behaviour of the model on long and/or complex inputs is currently unknown.
 
 
17
 
18
  ## Model description
19
 
20
+ See the description of the [base model](https://huggingface.co/facebook/m2m100_418M).
21
 
22
  ## Intended uses & limitations
23
 
24
+ This is intended as a demonstration of the improvements brought by fine-tuning a large-scale many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far as I can tell it usually provides translations that are least as good as those of other available Breton-French translators, but this is has not been evaluated quantitatively at a large scale.
25
 
26
  ## Training and evaluation data
27
 
28
+ The training dataset consists of:
29
+
30
+ - The [OfisPublik corpus v1](https://opus.nlpl.eu/OfisPublik-v1.php) (Tyers, 2009)
31
+ - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
32
+ - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
33
+
34
+ These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results.
35
 
36
  ## Training procedure
37
 
38
+ The training hyperparameters are those suggested by Adelani et al. (2022) in their [code release]((https://github.com/masakhane-io/lafand-mt), which gave their best results for machine translation of several African languages.
39
+
40
+ More specifically, we use the [example training script](https://github.com/huggingface/transformers/blob/674f750a57431222fa2832503a108df3badf1564/examples/pytorch/translation/run_translation.py) provided by 🤗 Transformers for fine-tuning mBART with the following command
41
+
42
+ ```bash
43
+ python run_translation.py \
44
+ --model_name_or_path facebook/m2m100_418M \
45
+ --do_train \
46
+ --train_file {path_to_train_corpus} \
47
+ --source_lang br \
48
+ --target_lang fr \
49
+ --output_dir {path_to_model} \
50
+ --per_device_train_batch_size=4 \
51
+ --per_device_eval_batch_size=4 \
52
+ --overwrite_output_dir \
53
+ --predict_with_generate \
54
+ --forced_bos_token fr \
55
+ --save_steps 50000 \
56
+ --num_beams 10 \
57
+ ```
58
+
59
  ### Training hyperparameters
60
 
61
  The following hyperparameters were used during training:
 
67
  - lr_scheduler_type: linear
68
  - num_epochs: 3.0
69
 
 
 
 
 
70
  ### Framework versions
71
 
72
  - Transformers 4.23.1
73
  - Pytorch 1.12.1+cu116
74
  - Datasets 2.6.1
75
  - Tokenizers 0.13.1
76
+
77
+ ## References
78
+
79
+ - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United States: Association for Computational Linguistics. <https://doi.org/10.18653/v1/2022.naacl-main.223>.
80
+ - Tiedemann, J., 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
81
+ - Tyers, F. M. (2009) "Rule-based augmentation of training data in Breton-French statistical machine translation ". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. pp. 213--218