Historical Irish BPE tokenizer was trained on Old, Middle, Early Modern, Classical Modern and pre-reform Modern Irish texts from St. Gall Glosses, Würzburg Glosses, CELT and the book subcorpus Historical Irish Corpus. The training data spans ca. 550 — 1926 and covers a wide variety of genres, such as bardic poetry, native Irish stories, translations and adaptations of continental epic and romance, annals, genealogies, grammatical and medical tracts, diaries, and religious writing. Due to code-switching in some texts, the model has some Latin in the vocabulary.
Byte-Pair Encoding (BPE) was introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. GPT-2 or RoBERTa. More advanced pre-tokenization include rule-based tokenization, e.g. XLM and FlauBERT which use Moses for most languages, or GPT, which uses spaCy
and ftfy
to count the frequency of each word in the training corpus.
After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size, which is a hyperparameter. This tokenizer was trained with vocab_size=25000
and min_frequency=2
.
Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ancatmara/historical-irish-tokenizer-bpe")
texts = ['Boí Óengus in n-aidchi n-aili inna chotlud.', 'Co n-accae ní, in n-ingin cucci for crunn síuil dó.']
tokenizer(texts, max_length=128, truncation=True)
Out:
>>> {'input_ids': [[0, 15076, 4813, 290, 155, 256, 3122, 155, 256, 1025, 1747, 12091, 225, 2], [0, 2677, 155, 256, 991, 697, 427, 235, 290, 155, 256, 2057, 424, 4199, 419, 1013, 517, 729, 615, 600, 225, 2]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
tokenizer.decode([0, 15076, 4813, 290, 155, 256, 3122, 155, 256, 1025, 1747, 12091, 225, 2])
Out:
>>> '<s>Boí Óengus in n - aidchi n - aili inna chotlud. </s>'