|
--- |
|
language: |
|
- el |
|
pipeline_tag: text-classification |
|
--- |
|
# PaloBERT for Sentiment Analysis |
|
|
|
A greek [RoBERTa](https://arxiv.org/abs/1907.11692) based model ([PaloBERT](https://huggingface.co./pchatz/palobert-base-greek-social-media): an updated version of [palobert-base-greek-uncased-v1](https://huggingface.co./gealexandri/palobert-base-greek-uncased-v1)) fine-tuned for sentiment analysis. |
|
|
|
## Training data |
|
|
|
The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included. The fine-tuning process is done on a dataset of ~60,000 documents, also collected from greek social media. |
|
|
|
The corpus as well as the annotated dataset have been provided by [Palo LTD](http://www.paloservices.com/). |
|
|
|
## Requirements |
|
|
|
``` |
|
pip install transformers |
|
pip install torch |
|
|
|
``` |
|
|
|
## Pre-processing details |
|
|
|
In order to use this model, the text needs to be pre-processed as follows: |
|
|
|
* remove all greek diacritics |
|
* convert to lowercase |
|
* remove all punctuation |
|
|
|
```python |
|
import re |
|
import unicodedata |
|
|
|
def preprocess(text, default_replace=""): |
|
text = text.lower() |
|
text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None}) |
|
text = re.sub(r'[^\w\s]', default_replace, text) |
|
return text |
|
``` |
|
|
|
## Load Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media-v2") #load PaloBERT pre-trained model |
|
language_model = AutoModel.from_pretrained("pchatz/palobert-base-greek-social-media-v2") |
|
``` |
|
Refer to [GitHub](https://github.com/Paulinechatz/sentiment-analysis-greek-social-media/blob/main/code/train_classifier_roberta_arch.py#L100) code for details on ModelClass architecture |
|
```python |
|
model = TheModelClass(*args, **kwargs) #load fine-tuned model as SentimentClassifier_v2 |
|
model.load_state_dict(torch.load(PATH)) |
|
model.eval() |
|
``` |
|
You can use this sentiment analysis model directly on raw text: |
|
```python |
|
#Example |
|
class_names={0: 'neutral', 1:'positive', 2:'negative'} |
|
text='οι εξετασεις ηταν πολυ καλες' |
|
encoding=tokenizer(text,return_tensors='pt') |
|
|
|
input_ids = encoding['input_ids'] |
|
attention_mask = encoding['attention_mask'] |
|
|
|
output = model(input_ids, attention_mask) |
|
_,prediction = torch.max(output, dim=1) |
|
|
|
print(f'sentiment : {class_names[prediction.item()]}') #positive |
|
``` |
|
|
|
## Evaluation |
|
|
|
For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών'](http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2) |
|
|
|
## Author |
|
|
|
[Pavlina Chatziantoniou](https://huggingface.co./pchatz), [Georgios Alexandridis](https://huggingface.co./gealexandri) and Athanasios Voulodimos |
|
|
|
## BibTeX entry and citation info |
|
|
|
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623 |
|
|
|
```bibtex |
|
|
|
@Article{info12080331, |
|
AUTHOR = {Alexandridis, Georgios and Varlamis, Iraklis and Korovesis, Konstantinos and Caridakis, George and Tsantilas, Panagiotis}, |
|
TITLE = {A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media}, |
|
JOURNAL = {Information}, |
|
VOLUME = {12}, |
|
YEAR = {2021}, |
|
NUMBER = {8}, |
|
ARTICLE-NUMBER = {331}, |
|
URL = {https://www.mdpi.com/2078-2489/12/8/331}, |
|
ISSN = {2078-2489}, |
|
DOI = {10.3390/info12080331} |
|
} |
|
``` |