File size: 2,787 Bytes
0d51201 ab03fb1 15d2e3e ab03fb1 d3e46b3 0d51201 ab03fb1 f19a81d e7a5ee7 050e67d 38f7e93 470295c f115847 f19a81d 38f7e93 8bd834e 050e67d 38f7e93 ab03fb1 050e67d ab03fb1 050e67d ab03fb1 38f7e93 050e67d f19a81d 38f7e93 050e67d ab03fb1 d3e46b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
license: apache-2.0
tags:
- generated_from_trainer
model-index:
- name: bart-base-spelling-nl-1m-3
results: []
language:
- nl
metrics:
- bleu
- cer
- wer
- meteor
pipeline_tag: text2text-generation
---
# bart-base-spelling-nl
This model is a Dutch fine-tuned version of
[facebook/bart-base](https://huggingface.co./facebook/bart-base).
It achieves the following results on an external evaluation set of human-corrected spelling
errors of Dutch snippets of internet text ([errors](https://huggingface.co./antalvdb/bart-base-spelling-nl/blob/main/opentaal-annotaties.txt.errors)
and [corrections](https://huggingface.co./antalvdb/bart-base-spelling-nl/blob/main/opentaal-annotaties.txt.corrections),
run [spell.py](https://huggingface.co./antalvdb/bart-base-spelling-nl/blob/main/spell.py))
* CER - 0.024
* WER - 0.088
* BLEU - 0.840
* METEOR - 0.932
Note that it is very hard for any spelling corrector to clean more actual spelling errors
than introducing new errors. In other words, most spelling correctors cannot be run
automatically and must be used interactively.
These are the upper-bound scores when correcting _nothing_. In other words, this is
the actual distance between the errors and their corrections in the evaluation set:
* CER - 0.010
* WER - 0.053
* BLEU - 0.900
* METEOR - 0.954
We are not there yet, clearly.
## Model description
This is a fine-tuned version of
[facebook/bart-base](https://huggingface.co./facebook/bart-base)
trained on spelling correction. It leans on the excellent work by
Oliver Guhr ([github](https://github.com/oliverguhr/spelling),
[huggingface](https://huggingface.co./oliverguhr/spelling-correction-english-base)). Training
was performed on an AWS EC2 instance (g5.xlarge) on a single GPU, and
took about two days.
## Intended uses & limitations
The intended use for this model is to be a component of the
[Valkuil.net](https://valkuil.net) context-sensitive spelling
checker.
## Training and evaluation data
The model was trained on a Dutch dataset composed of 12,351,203 lines
of text, containing a total of 123,131,153 words, from three public Dutch sources, downloaded from the
[Opus corpus](https://opus.nlpl.eu/):
- nl-europarlv7.txt (2,387,000 lines)
- nl-opensubtitles2016.9m.txt (9,000,000 lines)
- nl-wikipedia.txt (964,203 lines)
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 2
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 16
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2.0
### Framework versions
- Transformers 4.27.3
- Pytorch 2.0.0+cu117
- Datasets 2.10.1
- Tokenizers 0.13.2 |