antalvdb
/

bart-base-spelling-nl

Text2Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

antalvdb commited on Aug 1, 2023

Commit

38f7e93

·

1 Parent(s): 3986880

Update README.md

Files changed (1) hide show

README.md +21 -7

README.md CHANGED Viewed

@@ -12,13 +12,27 @@ model-index:
 This model is a Dutch fine-tuned version of
 [facebook/bart-base](https://huggingface.co/facebook/bart-base).
-It achieves the following results on an external evaluation set:
-* CER    - 0.025
-* WER    - 0.090
-* BLEU   - 0.837
 * METEOR - 0.932
 ## Model description
@@ -38,12 +52,12 @@ checker.
 ## Training and evaluation data
-The model was trained on a Dutch dataset composed of 6,351,203 lines
-of text from three public Dutch sources, downloaded from the
 [Opus corpus](https://opus.nlpl.eu/):
 - nl-europarlv7.txt (2,387,000 lines)
-- nl-opensubtitles2016.3m.txt (3,000,000 lines)
 - nl-wikipedia.txt (964,203 lines)

 This model is a Dutch fine-tuned version of
 [facebook/bart-base](https://huggingface.co/facebook/bart-base).
+It achieves the following results on an external evaluation set of human-corrected spelling
+errors of Dutch snippets of internet text (evalua)
+* CER    - 0.024
+* WER    - 0.088
+* BLEU   - 0.840
 * METEOR - 0.932
+Note that it is very hard for any spelling corrector to clean more actual spelling errors
+than introducing new errors. In other words, most spelling correctors cannot be run
+automatically and must be used interactively.
+These are the upper-bound scores when correcting _nothing_. In other words, this is
+the actual distance between the errors and their corrections in the evaluation set:
+* CER    - 0.010
+* WER    - 0.053
+* BLEU   - 0.900
+* METEOR - 0.954
+We are not there yet, clearly.
 ## Model description
 ## Training and evaluation data
+The model was trained on a Dutch dataset composed of 12,351,203 lines
+of text, containing a total of 123,131,153 words, from three public Dutch sources, downloaded from the
 [Opus corpus](https://opus.nlpl.eu/):
 - nl-europarlv7.txt (2,387,000 lines)
+- nl-opensubtitles2016.9m.txt (9,000,000 lines)
 - nl-wikipedia.txt (964,203 lines)