GiliGold commited on
Commit
091e64e
1 Parent(s): 5f36803

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -3
README.md CHANGED
@@ -1,3 +1,103 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - HaifaCLGroup/KnessetCorpus
5
+ language:
6
+ - he
7
+ tags:
8
+ - hebrew
9
+ - nlp
10
+ - masked-language-model
11
+ - transformers
12
+ - BERT
13
+ - parliamentary-proceedings
14
+ - language-model
15
+ - Knesset
16
+ - DictaBERT
17
+ - fine-tuning
18
+
19
+ ---
20
+ # Knesset-DictaBERT
21
+ **Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus),
22
+ which comprises Israeli parliamentary proceedings.
23
+
24
+ This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture
25
+ and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.
26
+
27
+
28
+ ## Model Details
29
+
30
+ - **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers)
31
+ - **Language**: Hebrew
32
+ - **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
33
+ - **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)
34
+
35
+ ## Training Procedure
36
+
37
+ The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.
38
+
39
+ ## Usage
40
+ ```python
41
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
42
+ import torch
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained("your-username/Knesset-DictaBERT")
45
+ model = AutoModelForMaskedLM.from_pretrained("your-username/Knesset-DictaBERT")
46
+
47
+ model.eval()
48
+
49
+ sentence = "הכנסת היא הרשות [MASK] של מדינת ישראל."
50
+
51
+ # Tokenize the input sentence and get predictions
52
+ inputs = tokenizer.encode(sentence, return_tensors='pt')
53
+ output = model(inputs)
54
+
55
+ # The [MASK] token is the 5th token in the sentence (including [CLS])
56
+ mask_token_index = 5
57
+ top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]
58
+
59
+ # Convert token IDs to tokens and print them
60
+ print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))
61
+
62
+ # Example output: המבצעת / המחוקקת
63
+
64
+ ```
65
+
66
+ ## Evaluation
67
+ The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
68
+ The perplexity was calculated on this full test set.
69
+ Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 3 million sentences (approximately 520 million tokens).
70
+
71
+ #### Perplexity
72
+ The perplexity of the original DictaBERT on the full test set is 22.87.
73
+ The perplexity of Knesset-DictaBERT on the full test set is 6.60.
74
+ #### Accuracy
75
+ - **1-accuracy results**
76
+
77
+ Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.
78
+ The original DictaBERT model achieved a top-1 accuracy of 48.02%.
79
+
80
+
81
+ - **2-accuracy results**
82
+
83
+ Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.
84
+ The original Dicta model achieved a top-2 accuracy of 58.60%.
85
+
86
+
87
+ - **5-accuracy results**
88
+ Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.
89
+ The original Dicta model achieved a top-5 accuracy of 68.98%.
90
+
91
+ ## Acknowledgments
92
+ This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.
93
+
94
+ ## Citation
95
+ If you use this model in your work, please cite:
96
+
97
+ @misc{Knesset-DictaBERT,
98
+ author = {Gili Goldin},
99
+ title = {Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
100
+ year = {2024},
101
+ publisher = {Hugging Face},
102
+ howpublished = {\url{https://huggingface.co/GiliGold/Knesset-DictaBERT}},
103
+ }