antalvdb commited on
Commit
38f7e93
·
1 Parent(s): 3986880

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -7
README.md CHANGED
@@ -12,13 +12,27 @@ model-index:
12
  This model is a Dutch fine-tuned version of
13
  [facebook/bart-base](https://huggingface.co/facebook/bart-base).
14
 
15
- It achieves the following results on an external evaluation set:
 
16
 
17
- * CER - 0.025
18
- * WER - 0.090
19
- * BLEU - 0.837
20
  * METEOR - 0.932
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Model description
24
 
@@ -38,12 +52,12 @@ checker.
38
 
39
  ## Training and evaluation data
40
 
41
- The model was trained on a Dutch dataset composed of 6,351,203 lines
42
- of text from three public Dutch sources, downloaded from the
43
  [Opus corpus](https://opus.nlpl.eu/):
44
 
45
  - nl-europarlv7.txt (2,387,000 lines)
46
- - nl-opensubtitles2016.3m.txt (3,000,000 lines)
47
  - nl-wikipedia.txt (964,203 lines)
48
 
49
 
 
12
  This model is a Dutch fine-tuned version of
13
  [facebook/bart-base](https://huggingface.co/facebook/bart-base).
14
 
15
+ It achieves the following results on an external evaluation set of human-corrected spelling
16
+ errors of Dutch snippets of internet text (evalua)
17
 
18
+ * CER - 0.024
19
+ * WER - 0.088
20
+ * BLEU - 0.840
21
  * METEOR - 0.932
22
 
23
+ Note that it is very hard for any spelling corrector to clean more actual spelling errors
24
+ than introducing new errors. In other words, most spelling correctors cannot be run
25
+ automatically and must be used interactively.
26
+
27
+ These are the upper-bound scores when correcting _nothing_. In other words, this is
28
+ the actual distance between the errors and their corrections in the evaluation set:
29
+
30
+ * CER - 0.010
31
+ * WER - 0.053
32
+ * BLEU - 0.900
33
+ * METEOR - 0.954
34
+
35
+ We are not there yet, clearly.
36
 
37
  ## Model description
38
 
 
52
 
53
  ## Training and evaluation data
54
 
55
+ The model was trained on a Dutch dataset composed of 12,351,203 lines
56
+ of text, containing a total of 123,131,153 words, from three public Dutch sources, downloaded from the
57
  [Opus corpus](https://opus.nlpl.eu/):
58
 
59
  - nl-europarlv7.txt (2,387,000 lines)
60
+ - nl-opensubtitles2016.9m.txt (9,000,000 lines)
61
  - nl-wikipedia.txt (964,203 lines)
62
 
63