gaBERT
gaBERT is a BERT-base model trained on 7.9M Irish sentences. For more details, including the hyperparameters and pretraining corpora used please refer to our paper.
How to use gaBERT with HuggingFace
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("DCU-NLP/bert-base-irish-cased-v1")
model = AutoModelWithLMHead.from_pretrained("DCU-NLP/bert-base-irish-cased-v1")
sequence = f"Ceolt贸ir {tokenizer.mask_token} ab ea Johnny Cash."
input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
BibTeX entry and citation info
If you use this model in your research, please consider citing our paper:
@article{DBLP:journals/corr/abs-2107-12930,
author = {James Barry and
Joachim Wagner and
Lauren Cassidy and
Alan Cowap and
Teresa Lynn and
Abigail Walsh and
M{\'{\i}}che{\'{a}}l J. {\'{O}} Meachair and
Jennifer Foster},
title = {gaBERT - an Irish Language Model},
journal = {CoRR},
volume = {abs/2107.12930},
year = {2021},
url = {https://arxiv.org/abs/2107.12930},
archivePrefix = {arXiv},
eprint = {2107.12930},
timestamp = {Fri, 30 Jul 2021 13:03:06 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2107-12930.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}