pchatz commited on
Commit
c0c9452
·
1 Parent(s): e584c07

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - el
4
+ pipeline_tag: text-classification
5
+ ---
6
+ # PaloBERT for Sentiment Analysis
7
+
8
+ A greek [RoBERTa](https://arxiv.org/abs/1907.11692) based model ([PaloBERT](https://huggingface.co/pchatz/greeksocialbert-base-greek-social-media)) fine-tuned for sentiment analysis.
9
+
10
+ ## Training data
11
+
12
+ The model is pre-trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube). A RoBERTa tokenizer trained from scratch on the same corpus is also included. The fine-tuning process is done on a dataset of ~60,000 documents, also collected from greek social media.
13
+
14
+ The corpus as well as the annotated dataset have been provided by [Palo LTD](http://www.paloservices.com/).
15
+
16
+
17
+ ## Requirements
18
+
19
+ ```
20
+ pip install transformers
21
+ pip install torch
22
+
23
+ ```
24
+
25
+ ## Pre-processing details
26
+
27
+ In order to use 'palobert-base-greek-social-media-sentiment', the text needs to be pre-processed as follows:
28
+
29
+ * remove all greek diacritics
30
+ * convert to lowercase
31
+ * remove all punctuation
32
+
33
+ ```python
34
+ import re
35
+ import unicodedata
36
+
37
+ def preprocess(text, default_replace=""):
38
+ text = text.lower()
39
+ text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
40
+ text = re.sub(r'[^\w\s]', default_replace, text)
41
+ return text
42
+ ```
43
+
44
+ ## Load Model
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("pchatz/palobert-base-greek-social-media-sentiment")
50
+
51
+ model = AutoModelForMaskedLM.from_pretrained("pchatz/palobert-base-greek-social-media-sentiment")
52
+ ```
53
+ You can use this model directly with a pipeline for masked language modeling:
54
+
55
+
56
+ ## Evaluation
57
+
58
+ For detailed results refer to Thesis: ['Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών'](http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623) (version - p2)
59
+
60
+ ## Author
61
+
62
+ Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos
63
+
64
+ ## Citation info
65
+
66
+ http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623