RoBERTa trained on Sanskrit (SanBERTa)

Mode size (after training): 340MB

Dataset:

Wikipedia articles (used in iNLTK). It contains evaluation set.

Sanskrit scraps from CLTK

Configuration

Parameter Value
num_attention_heads 12
num_hidden_layers 6
hidden_size 768
vocab_size 29407

Training :

  • On TPU
  • For language modelling
  • Iteratively increasing --block_size from 128 to 256 over epochs

Evaluation

Metric # Value
Perplexity (block_size=256) 4.04

Example of usage:

For Embeddings


tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa")
model = RobertaModel.from_pretrained("surajp/SanBERTa")

op = tokenizer.encode("इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।", return_tensors="pt")
ps = model(op)
ps[0].shape
'''
Output:
--------
torch.Size([1, 47, 768])

For <mask> Prediction

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="surajp/SanBERTa",
    tokenizer="surajp/SanBERTa"
)

## इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
fill_mask("इयं भाषा न केवल<mask> भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।")

ps = model(torch.tensor(enc).unsqueeze(1))
print(ps[0].shape)
'''
Output:
--------
[{'score': 0.7516744136810303,
  'sequence': '<s> इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
  'token': 280,
  'token_str': 'à¤Ĥ'},
 {'score': 0.06230105459690094,
  'sequence': '<s> इयं भाषा न केवली भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
  'token': 289,
  'token_str': 'à¥Ģ'},
 {'score': 0.055410224944353104,
  'sequence': '<s> इयं भाषा न केवला भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
  'token': 265,
  'token_str': 'ा'},
  ...]
@misc{Parmar2020Sanberta,
  author = {Parmar, Suraj},
  title = {SanBERTa - a RoBERTa trained on Sanskrit},
  year = {2020},
  month = {Jun},
  publisher = {Hugging Face Model Hub},
  url = {https://huggingface.co./surajp/SanBERTa}
}

It works!! 🎉 🎉 🎉

Created by Suraj Parmar/@parmarsuraj99 | LinkedIn

Made with in India

Downloads last month
21
Safetensors
Model size
89.3M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.