lgrobol
/

m2m100_418M_br_fr

@@ -36,7 +36,7 @@ The training dataset consists of:
 - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
 - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
-These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.
 ## Training procedure
@@ -82,5 +82,6 @@ The following hyperparameters were used during training:
 ## References
 - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United States: Association for Computational Linguistics. <https://doi.org/10.18653/v1/2022.naacl-main.223>.
-- Tiedemann, J., 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
-- Tyers, F. M. (2009) "Rule-based augmentation of training data in Breton-French statistical machine translation ". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. pp. 213--218

 - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
 - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
+These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.
 ## Training procedure
 ## References
 - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United States: Association for Computational Linguistics. <https://doi.org/10.18653/v1/2022.naacl-main.223>.
+- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational Linguistics.
+- Tiedemann, Jorg 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
+- Tyers, Francis M. 2009 "Rule-based augmentation of training data in Breton-French statistical machine translation ". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. Barcelona, España. 213--218