19th Century Dutch Spelling Normalization
This repository contains a pretrained and finetuned model of the original google/ByT5-small. This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization. We first pretrained the model with 2 million sentences from Dutch historical novels. Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences; these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).
The finetuned model is only available in the TensorFlow format but can be converted to a PyTorch environment. The pretrained only weights are available in the PyTorch environment; note that this model has to be finetuned first. The pretrained only weights are available in the directory Pretrained_ByT5. The train and validation sets used for finetuning are available in the main repository. For further information about the model, please see the GitHub repository.
How to use:
from transformers import AutoTokenizer, TFT5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
text = 'De menschen waren aan het werk.'
tokenized = tokenizer(text, return_tensors='tf')
prediction = model.generate(input_ids=tokenized['input_ids'],
attention_mask=tokenized['attention_mask'],
max_new_tokens=100)
print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))
Setup:
The model has been finetuned with the following (hyper)parameters values:
Learn rate: 5e-5
Batch size: 32
Optimizer: AdamW
Epochs: 30, with earlystopping
To further finetune the model, use the T5Trainer.py script.
- Downloads last month
- 2