codewithkyrian commited on
Commit
41bbf89
verified
1 Parent(s): 28a5fff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: Transformers PHP
3
+ tags:
4
+ - onnx
5
+ ---
6
+
7
+ https://huggingface.co/avichr/heBERT_sentiment_analysis with ONNX weights to be compatible with Transformers PHP
8
+
9
+ ## HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
10
+ HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture and it is BERT-Base config [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805). <br>
11
+
12
+ HeBert was trained on three datasets:
13
+ 1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 million sentences.
14
+ 2. A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 million words and 3.8 million sentences
15
+ 3. Emotion UGC data was collected for the purpose of this study. (described below)
16
+ We evaluated the model on emotion recognition and sentiment analysis, for downstream tasks.
17
+
18
+ ### Emotion UGC Data Description
19
+ Our User-Generated Content (UGC) is comments written on articles collected from 3 major news sites, between January 2020 to August 2020, Total data size of ~150 MB of data, including over 7 million words and 350K sentences.
20
+ 4000 sentences annotated by crowd members (3-10 annotators per sentence) for 8 emotions (anger, disgust, expectation, fear, happy, sadness, surprise, and trust) and overall sentiment/polarity <br>
21
+ In order to validate the annotation, we search for an agreement between raters to emotion in each sentence using Krippendorff's alpha [(krippendorff, 1970)](https://journals.sagepub.com/doi/pdf/10.1177/001316447003000105). We left sentences that got alpha > 0.7. Note that while we found a general agreement between raters about emotions like happiness, trust, and disgust, there are few emotions with general disagreement about them, apparently given the complexity of finding them in the text (e.g. expectation and surprise).
22
+
23
+ ### Performance
24
+ #### sentiment analysis
25
+
26
+ | | precision | recall | f1-score |
27
+ |--------------|-----------|--------|----------|
28
+ | natural | 0.83 | 0.56 | 0.67 |
29
+ | positive | 0.96 | 0.92 | 0.94 |
30
+ | negative | 0.97 | 0.99 | 0.98 |
31
+ | accuracy | | | 0.97 |
32
+ | macro avg | 0.92 | 0.82 | 0.86 |
33
+ | weighted avg | 0.96 | 0.97 | 0.96 |
34
+
35
+ ## How to use
36
+ ### For masked-LM model (can be fine-tunned to any down-stream task)
37
+ ```
38
+ from transformers import AutoTokenizer, AutoModel
39
+ tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
40
+ model = AutoModel.from_pretrained("avichr/heBERT")
41
+
42
+ from transformers import pipeline
43
+ fill_mask = pipeline(
44
+ "fill-mask",
45
+ model="avichr/heBERT",
46
+ tokenizer="avichr/heBERT"
47
+ )
48
+ fill_mask("讛拽讜专讜谞讛 诇拽讞讛 讗转 [MASK] 讜诇谞讜 诇讗 谞砖讗专 讚讘专.")
49
+ ```
50
+
51
+ ### For sentiment classification model (polarity ONLY):
52
+ ```
53
+ from transformers import AutoTokenizer, AutoModel, pipeline
54
+ tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") #same as 'avichr/heBERT' tokenizer
55
+ model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
56
+
57
+ # how to use?
58
+ sentiment_analysis = pipeline(
59
+ "sentiment-analysis",
60
+ model="avichr/heBERT_sentiment_analysis",
61
+ tokenizer="avichr/heBERT_sentiment_analysis",
62
+ return_all_scores = True
63
+ )
64
+
65
+ >>> sentiment_analysis('讗谞讬 诪转诇讘讟 诪讛 诇讗讻讜诇 诇讗专讜讞转 爪讛专讬讬诐')
66
+ [[{'label': 'natural', 'score': 0.9978172183036804},
67
+ {'label': 'positive', 'score': 0.0014792329166084528},
68
+ {'label': 'negative', 'score': 0.0007035882445052266}]]
69
+
70
+ >>> sentiment_analysis('拽驻讛 讝讛 讟注讬诐')
71
+ [[{'label': 'natural', 'score': 0.00047328314394690096},
72
+ {'label': 'possitive', 'score': 0.9994067549705505},
73
+ {'label': 'negetive', 'score': 0.00011996887042187154}]]
74
+
75
+ >>> sentiment_analysis('讗谞讬 诇讗 讗讜讛讘 讗转 讛注讜诇诐')
76
+ [[{'label': 'natural', 'score': 9.214012970915064e-05},
77
+ {'label': 'possitive', 'score': 8.876807987689972e-05},
78
+ {'label': 'negetive', 'score': 0.9998190999031067}]]
79
+ ```
80
+
81
+ Our model is also available on AWS! for more information visit [AWS' git](https://github.com/aws-samples/aws-lambda-docker-serverless-inference/tree/main/hebert-sentiment-analysis-inference-docker-lambda)
82
+
83
+
84
+ ## Stay tuned!
85
+ We are still working on our model and will edit this page as we progress.<br>
86
+ Note that we have released only sentiment analysis (polarity) at this point, emotion detection will be released later on.<br>
87
+ our git: https://github.com/avichaychriqui/HeBERT
88
+
89
+
90
+ ## If you used this model please cite us as :
91
+ Chriqui, A., & Yahav, I. (2021). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909.
92
+ ```
93
+ @article{chriqui2021hebert,
94
+ title={HeBERT \\\\\\\\\\\\\\\\& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
95
+ author={Chriqui, Avihay and Yahav, Inbal},
96
+ journal={arXiv preprint arXiv:2102.01909},
97
+ year={2021}
98
+ }
99
+ ```
100
+
101
+
102
+ ---
103
+
104
+ Note: Having a separate repo for ONNX weights is intended to be a temporary solution until WebML gains more traction. If you would like to make your models web-ready, we recommend converting to ONNX using [馃 Optimum](https://huggingface.co/docs/optimum/index) and structuring your repo like this one (with ONNX weights located in a subfolder named `onnx`).