antalvdb
/

bart-base-spelling-nl

Text2Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

bart-base-spelling-nl / README.md

antalvdb's picture

Update README.md

d3e46b3 over 1 year ago

|

history blame contribute delete

2.79 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	model-index:
	- name: bart-base-spelling-nl-1m-3
	results: []
	language:
	- nl
	metrics:
	- bleu
	- cer
	- wer
	- meteor
	pipeline_tag: text2text-generation
	---

	# bart-base-spelling-nl

	This model is a Dutch fine-tuned version of
	[facebook/bart-base](https://huggingface.co./facebook/bart-base).

	It achieves the following results on an external evaluation set of human-corrected spelling
	errors of Dutch snippets of internet text ([errors](https://huggingface.co./antalvdb/bart-base-spelling-nl/blob/main/opentaal-annotaties.txt.errors)
	and [corrections](https://huggingface.co./antalvdb/bart-base-spelling-nl/blob/main/opentaal-annotaties.txt.corrections),
	run [spell.py](https://huggingface.co./antalvdb/bart-base-spelling-nl/blob/main/spell.py))

	* CER - 0.024
	* WER - 0.088
	* BLEU - 0.840
	* METEOR - 0.932

	Note that it is very hard for any spelling corrector to clean more actual spelling errors
	than introducing new errors. In other words, most spelling correctors cannot be run
	automatically and must be used interactively.

	These are the upper-bound scores when correcting _nothing_. In other words, this is
	the actual distance between the errors and their corrections in the evaluation set:

	* CER - 0.010
	* WER - 0.053
	* BLEU - 0.900
	* METEOR - 0.954

	We are not there yet, clearly.

	## Model description

	This is a fine-tuned version of
	[facebook/bart-base](https://huggingface.co./facebook/bart-base)
	trained on spelling correction. It leans on the excellent work by
	Oliver Guhr ([github](https://github.com/oliverguhr/spelling),
	[huggingface](https://huggingface.co./oliverguhr/spelling-correction-english-base)). Training
	was performed on an AWS EC2 instance (g5.xlarge) on a single GPU, and
	took about two days.

	## Intended uses & limitations

	The intended use for this model is to be a component of the
	[Valkuil.net](https://valkuil.net) context-sensitive spelling
	checker.

	## Training and evaluation data

	The model was trained on a Dutch dataset composed of 12,351,203 lines
	of text, containing a total of 123,131,153 words, from three public Dutch sources, downloaded from the
	[Opus corpus](https://opus.nlpl.eu/):

	- nl-europarlv7.txt (2,387,000 lines)
	- nl-opensubtitles2016.9m.txt (9,000,000 lines)
	- nl-wikipedia.txt (964,203 lines)


	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 2
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 16
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 2.0


	### Framework versions

	- Transformers 4.27.3
	- Pytorch 2.0.0+cu117
	- Datasets 2.10.1
	- Tokenizers 0.13.2