Update README.md
Browse files
README.md
CHANGED
@@ -12,13 +12,27 @@ model-index:
|
|
12 |
This model is a Dutch fine-tuned version of
|
13 |
[facebook/bart-base](https://huggingface.co/facebook/bart-base).
|
14 |
|
15 |
-
It achieves the following results on an external evaluation set
|
|
|
16 |
|
17 |
-
* CER - 0.
|
18 |
-
* WER - 0.
|
19 |
-
* BLEU - 0.
|
20 |
* METEOR - 0.932
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Model description
|
24 |
|
@@ -38,12 +52,12 @@ checker.
|
|
38 |
|
39 |
## Training and evaluation data
|
40 |
|
41 |
-
The model was trained on a Dutch dataset composed of
|
42 |
-
of text from three public Dutch sources, downloaded from the
|
43 |
[Opus corpus](https://opus.nlpl.eu/):
|
44 |
|
45 |
- nl-europarlv7.txt (2,387,000 lines)
|
46 |
-
- nl-opensubtitles2016.
|
47 |
- nl-wikipedia.txt (964,203 lines)
|
48 |
|
49 |
|
|
|
12 |
This model is a Dutch fine-tuned version of
|
13 |
[facebook/bart-base](https://huggingface.co/facebook/bart-base).
|
14 |
|
15 |
+
It achieves the following results on an external evaluation set of human-corrected spelling
|
16 |
+
errors of Dutch snippets of internet text (evalua)
|
17 |
|
18 |
+
* CER - 0.024
|
19 |
+
* WER - 0.088
|
20 |
+
* BLEU - 0.840
|
21 |
* METEOR - 0.932
|
22 |
|
23 |
+
Note that it is very hard for any spelling corrector to clean more actual spelling errors
|
24 |
+
than introducing new errors. In other words, most spelling correctors cannot be run
|
25 |
+
automatically and must be used interactively.
|
26 |
+
|
27 |
+
These are the upper-bound scores when correcting _nothing_. In other words, this is
|
28 |
+
the actual distance between the errors and their corrections in the evaluation set:
|
29 |
+
|
30 |
+
* CER - 0.010
|
31 |
+
* WER - 0.053
|
32 |
+
* BLEU - 0.900
|
33 |
+
* METEOR - 0.954
|
34 |
+
|
35 |
+
We are not there yet, clearly.
|
36 |
|
37 |
## Model description
|
38 |
|
|
|
52 |
|
53 |
## Training and evaluation data
|
54 |
|
55 |
+
The model was trained on a Dutch dataset composed of 12,351,203 lines
|
56 |
+
of text, containing a total of 123,131,153 words, from three public Dutch sources, downloaded from the
|
57 |
[Opus corpus](https://opus.nlpl.eu/):
|
58 |
|
59 |
- nl-europarlv7.txt (2,387,000 lines)
|
60 |
+
- nl-opensubtitles2016.9m.txt (9,000,000 lines)
|
61 |
- nl-wikipedia.txt (964,203 lines)
|
62 |
|
63 |
|