hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

  • English (British Library Corpus - Books)

More details can be found in our GitHub repository.

Pretraining

We use the official JAX/FLAX example in Hugging Face Transformers to pretrain a ByT5 model on a single v3-8 TPU. Details about the training can be found here.

This model was trained with mean_noise_span_length=20 for one epoch.

Mean Noise Span Length

The previously pretrained hmByT5 models "accidentally" use a mean noise span length of 3, because this value is the default one for T5. But the ByT5 paper mentions, that using a length of 3 would make pretraining tasks too easy, and recommend a value of 20. Thus, we pretrained this model with mean_noise_span_length=20 and fine-tuned it on English AjMC dataset:

Configuration Run 1 Run 2 Run 3 Run 4 Run 5 Avg.
wsFalse-bs4-e10-lr0.00015-poolingfirst 85.48 84.6 85.65 86.83 86.53 85.82 ± 0.79
wsFalse-bs4-e10-lr0.00016-poolingfirst 85.35 84.5 86.05 85.1 85.18 85.24 ± 0.5
wsFalse-bs8-e10-lr0.00016-poolingfirst 84.14 83.45 84.4 84.9 85.82 84.54 ± 0.79
wsFalse-bs8-e10-lr0.00015-poolingfirst 85.27 85.3 83.33 85.25 81.7 84.17 ± 1.45

For comparison the model using a length of 3 achieved 85.65 ± 1.21.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.