19th Century Dutch Spelling Normalization

This repository contains a pretrained and finetuned model of the original google/ByT5-small. This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization. We first pretrained the model with 2 million sentences from Dutch historical novels. Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences; these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).

The finetuned model is only available in the TensorFlow format but can be converted to a PyTorch environment. The pretrained only weights are available in the PyTorch environment; note that this model has to be finetuned first. The pretrained only weights are available in the directory Pretrained_ByT5. The train and validation sets used for finetuning are available in the main repository. For further information about the model, please see the GitHub repository.

How to use:

from transformers import AutoTokenizer, TFT5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')

text = 'De menschen waren aan het werk.'
tokenized = tokenizer(text, return_tensors='tf')

prediction = model.generate(input_ids=tokenized['input_ids'],
                            attention_mask=tokenized['attention_mask'],
                            max_new_tokens=100)

print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))

Setup:

The model has been finetuned with the following (hyper)parameters values:

Learn rate: 5e-5
Batch size: 32
Optimizer: AdamW
Epochs: 30, with earlystopping

To further finetune the model, use the T5Trainer.py script.

Downloads last month
2
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.