PaloBERT

A greek pre-trained language model based on RoBERTa. This model is an updated version of palobert-base-greek-uncased-v1.

Pre-training data

The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included.

The corpus has been provided by Palo LTD.

Requirements

pip install transformers
pip install torch

Pre-processing details

In order to use this model, the text needs to be pre-processed as follows:

  • remove all greek diacritics
  • convert to lowercase
  • remove all punctuation
import re
import unicodedata

def preprocess(text, default_replace=""):
  text = text.lower()
  text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
  text = re.sub(r'[^\w\s]', default_replace, text)
  return text

Load Model

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media-v2")

model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media-v2")

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης')

[{'score': 0.8760559558868408,
  'token': 12853,
  'token_str': ' κοινωνικης',
  'sequence': 'μεσα κοινωνικης δικτυωσης'},
 {'score': 0.020922638475894928,
  'token': 1104,
  'token_str': ' μεσα',
  'sequence': 'μεσα μεσα δικτυωσης'},
 {'score': 0.017568595707416534,
  'token': 337,
  'token_str': ' της',
  'sequence': 'μεσα της δικτυωσης'},
 {'score': 0.006678201723843813,
  'token': 1258,
  'token_str': 'τικης',
  'sequence': 'μεσατικης δικτυωσης'},
 {'score': 0.004737381357699633,
  'token': 16245,
  'token_str': 'τερης',
  'sequence': 'μεσατερης δικτυωσης'}]

Evaluation on MLM and Sentiment Analysis tasks

For detailed results refer to Thesis: 'Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών' (version - p2)

Author

Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos

BibTeX entry and Citation info

http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623


@Article{info12080331,
AUTHOR = {Alexandridis, Georgios and Varlamis, Iraklis and Korovesis, Konstantinos and Caridakis, George and Tsantilas, Panagiotis},
TITLE = {A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media},
JOURNAL = {Information},
VOLUME = {12},
YEAR = {2021},
NUMBER = {8},
ARTICLE-NUMBER = {331},
URL = {https://www.mdpi.com/2078-2489/12/8/331},
ISSN = {2078-2489},
DOI = {10.3390/info12080331}
}
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.