File size: 3,461 Bytes
c0c9452
 
 
 
 
 
 
147a1a3
c0c9452
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6596f6
c0c9452
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fce9aef
 
 
 
 
2fc56ec
 
fce9aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0c9452
 
 
 
 
 
 
147a1a3
c0c9452
07ddcf0
c0c9452
147a1a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language:
- el
pipeline_tag: text-classification
---
# PaloBERT for Sentiment Analysis

A greek [RoBERTa](https://arxiv.org/abs/1907.11692) based model ([PaloBERT](https://huggingface.co./pchatz/palobert-base-greek-social-media): an updated version of [palobert-base-greek-uncased-v1](https://huggingface.co./gealexandri/palobert-base-greek-uncased-v1)) fine-tuned for sentiment analysis.

## Training data

The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included. The fine-tuning process is done on a dataset of ~60,000 documents, also collected from greek social media.

The corpus as well as the annotated dataset have been provided by [Palo LTD](http://www.paloservices.com/).

## Requirements

```
pip install transformers
pip install torch

```

## Pre-processing details

In order to use this model, the text needs to be pre-processed as follows:

* remove all greek diacritics
* convert to lowercase
* remove all punctuation

```python
import re
import unicodedata

def preprocess(text, default_replace=""):
  text = text.lower()
  text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
  text = re.sub(r'[^\w\s]', default_replace, text)
  return text
```

## Load Model

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media-v2") #load PaloBERT pre-trained model
language_model = AutoModel.from_pretrained("pchatz/palobert-base-greek-social-media-v2")
```
Refer to [GitHub](https://github.com/Paulinechatz/sentiment-analysis-greek-social-media/blob/main/code/train_classifier_roberta_arch.py#L100) code for details on ModelClass architecture
```python
model = TheModelClass(*args, **kwargs) #load fine-tuned model as SentimentClassifier_v2
model.load_state_dict(torch.load(PATH))
model.eval()
```
You can use this sentiment analysis model directly on raw text:
```python
#Example
class_names={0: 'neutral', 1:'positive', 2:'negative'}
text='οι εξετασεις ηταν πολυ καλες' 
encoding=tokenizer(text,return_tensors='pt')

input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

output = model(input_ids, attention_mask)
_,prediction = torch.max(output, dim=1)

print(f'sentiment  : {class_names[prediction.item()]}') #positive
```

## Evaluation

For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών'](http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2)

## Author

[Pavlina Chatziantoniou](https://huggingface.co./pchatz), [Georgios Alexandridis](https://huggingface.co./gealexandri) and Athanasios Voulodimos

## BibTeX entry and citation info

http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623

```bibtex

@Article{info12080331,
AUTHOR = {Alexandridis, Georgios and Varlamis, Iraklis and Korovesis, Konstantinos and Caridakis, George and Tsantilas, Panagiotis},
TITLE = {A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media},
JOURNAL = {Information},
VOLUME = {12},
YEAR = {2021},
NUMBER = {8},
ARTICLE-NUMBER = {331},
URL = {https://www.mdpi.com/2078-2489/12/8/331},
ISSN = {2078-2489},
DOI = {10.3390/info12080331}
}
```