NadavShaked commited on
Commit
cdc8965
1 Parent(s): 91da6cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -1
README.md CHANGED
@@ -5,4 +5,161 @@ tags:
5
  - roberta
6
  - Language model
7
  pipeline_tag: text-generation
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - roberta
6
  - Language model
7
  pipeline_tag: text-generation
8
+ ---
9
+
10
+ # D-Nikud
11
+
12
+ Welcome to the D-Nikud Diacritization Model main code repository! This repository is dedicated to the implementation of our innovative D-Nikud model, which use the TavBERT architecture and Bi-LSTM to predict and apply diacritics (nikud) to Hebrew text. Diacritics play a crucial role in accurately conveying pronunciation and interpretation, making our model an essential tool for enhancing the quality of Hebrew text analysis.
13
+
14
+ The code provided here encompasses various functionalities, including prediction, evaluation, and training of the D-Nikud diacritization model.
15
+
16
+ Repository for the paper [D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models](https://arxiv.org/abs/2402.00075) by Nadav Shaked and Adi Rosenthal.
17
+
18
+ ## Prerequisites
19
+
20
+ Before running the script, make sure you have the following installed:
21
+
22
+ - Tested with Python 3.10
23
+ - `torch` library (PyTorch)
24
+ - `transformers` library
25
+ - Required Python packages (Install using `pip install -r requirements.txt`)
26
+
27
+ ## Table of Contents
28
+ - [Introduction](#introduction)
29
+ - [Pre-Trained model](#Pre-Trained-model)
30
+ - [Usage](#usage)
31
+ - [Predict](#predict)
32
+ - [Evaluate](#evaluate)
33
+ - [Train](#train)
34
+ - [Requirements](#requirements)
35
+ - [License](#license)
36
+
37
+ ## Introduction
38
+
39
+ Our D-Nikud model utilizes the TevBERT architecture and Bi-LSTM for diacritization (nikud) of Hebrew text. Diacritics (nikud) are essential for accurate pronunciation and interpretation of the text. This repository provides the core code for implementing and utilizing the D-Nikud model.
40
+
41
+ ## Pre Trained model
42
+
43
+ Our pre-trained D-Nikud model can be found at [Link](https://drive.google.com/drive/folders/1osK503txvsEWlZASBViSqOiNJMzAlD0F). To use it, unzip the downloaded file and copy the contents to the 'models' folder.
44
+
45
+ ## Usage
46
+
47
+ Clone the repository:
48
+
49
+ ```bash
50
+ git clone https://github.com/NadavShaked/D_Nikud.git
51
+ cd D-Nikud
52
+ ```
53
+ Clone D-Nikud data:
54
+ ```bash
55
+ git submodule update --init --recursive
56
+ ```
57
+
58
+
59
+ ### Predict
60
+
61
+ The "Predict" command enables the prediction of diacritics for input text files or folders containing diacritized or un-diacritized text. It generates diacritization predictions using the specified diacritization model and saves the results to the specified output file. Optionally, you can choose to predict text for comparison with Nakdimon using the `-c/--compare` flag.
62
+ To predict diacritics for input text files or folders, use the following command:
63
+
64
+ ```bash
65
+ python main.py predict <input_path> <output_path> [-c/--compare <compare_nakdimon>] [-ptmp/--pretrain_model_path <pretrain_model_path>]
66
+ ```
67
+
68
+ - `<input_path>`: Path to the input file or folder containing text data.
69
+ - `<output_path>`: Path to the output file where the predicted diacritized text will be saved.
70
+ - `-c/--compare`: Optional. Set to `True` to predict text for comparison with Nakdimon.
71
+ - `-ptmp/--pretrain_model_path`: Optional. Path to the pre-trained model weights to be used for prediction. If not provided, the command will default to using our pre-trained D-Nikud model.
72
+
73
+ For example, to predict diacritics for a specific input text file and save the results to an output file, you can execute:
74
+
75
+ ```bash
76
+ python main.py predict input.txt output.txt
77
+ ```
78
+
79
+ If you wish to predict text for comparison with Nakdimon and specify a custom pre-trained model path, you can use:
80
+
81
+ ```bash
82
+ python main.py predict input_folder output_folder -c True -ptmp path/to/pretrained/model.pth
83
+ ```
84
+
85
+ Here, the command will predict diacritics for the texts in the `input_folder`, generate output files in the `output_folder`, and use the specified pre-trained model for prediction.
86
+
87
+ You can adapt the paths and options to suit your project's requirements. If the -ptmp parameter is omitted, the command will automatically employ our default pre-trained D-Nikud model for prediction.
88
+
89
+ ### Evaluate
90
+
91
+ The "Evaluate" command assesses the performance of the diacritization model by computing accuracy metrics for specific diacritics elements: nikud, dagesh, sin, as well as overall letter and word accuracy. This evaluation process involves comparing the model's diacritization results with the original diacritics text, providing insights into the model's effectiveness in accurately predicting and applying diacritics.
92
+
93
+ To evaluate the diacritization model, you can use the following command:
94
+
95
+ ```bash
96
+ python main.py evaluate <input_path> [-ptmp/--pretrain_model_path <pretrain_model_path>] [-df/--plots_folder <plots_folder>] [-es/--eval_sub_folders]
97
+ ```
98
+
99
+ - `<input_path>`: Path to the input file or folder containing text data for evaluation.
100
+ - `-ptmp/--pretrain_model_path`: Optional. Path to the pre-trained model weights to be employed for evaluation. If this parameter is not specified, the command will default to using our pre-trained D-Nikud model.
101
+ - `-df/--plots_folder`: Optional. Path to the folder where evaluation plots will be saved. If not provided, the default plots folder will be used.
102
+ - `-es/--eval_sub_folders`: Optional. Include this flag to enable accuracy calculation for sub-folders within the `input_path` folder, providing independent assessments for each subfolder.
103
+
104
+ For example, to evaluate the diacritization model's performance on a specific dataset, you might run:
105
+
106
+ ```bash
107
+ python main.py evaluate dataset_folder -ptmp path/to/pretrained/model.pth -df evaluation_plots
108
+ ```
109
+
110
+ This command will evaluate the model's accuracy on the dataset found in the `dataset_folder`, using the specified pre-trained model weights and saving evaluation plots in the `evaluation_plots` folder.
111
+
112
+ ### Train
113
+
114
+ The "Train" command enables the training of the diacritization model using your own dataset. This command supports fine-tuning a pre-trained model, adjusting hyperparameters such as learning rate and batch size, and specifying various training settings.
115
+
116
+ ⚠️ **Important Note:** Any file or folder in the specified data folder that contains the string "not_use" or "NakdanResults" in its name will be excluded from the training and testing processes. This feature allows you to selectively exclude specific data from the training process if needed.
117
+
118
+ To train the diacritization model, use the following command:
119
+
120
+ ```bash
121
+ python main.py train [--learning_rate <learning_rate>] [--batch_size <batch_size>]
122
+ [--n_epochs <n_epochs>] [--data_folder <data_folder>] [--checkpoints_frequency <checkpoints_frequency>]
123
+ [-df/--plots_folder <plots_folder>] [-ptmp/--pretrain_model_path <pretrain_model_path>]
124
+ ```
125
+
126
+ - `--learning_rate`: Optional. Learning rate for training (default is 0.001).
127
+ - `--batch_size`: Optional. Batch size for training (default is 32).
128
+ - `--n_epochs`: Optional. Number of training epochs (default is 10).
129
+ - `--data_folder`: Optional. Path to the folder containing training data (default is "data").
130
+ - `--checkpoints_frequency`: Optional. Frequency of saving model checkpoints during training (default is 1).
131
+ - `-df/--plots_folder`: Optional. Path to the folder where training plots will be saved.
132
+ - `-ptmp/--pretrain_model_path`: Optional. Path to the pre-trained model weights to be used for training continuation. Use this only if you want to fine-tune a specific pre-trained model.
133
+
134
+ ⚠️ **Folder Structure:** The `--data_folder` must have the following structure:
135
+ - **data_folder**
136
+ - **train**
137
+ - Contains training data
138
+ - **dev**
139
+ - Contains development/validation data
140
+ - **test**
141
+ - Contains testing data
142
+
143
+ For instance, to initiate training with a specified learning rate, batch size, and number of epochs, you can execute:
144
+
145
+ ```bash
146
+ python main.py train --learning_rate 0.001 --batch_size 16 --n_epochs 20
147
+ ```
148
+
149
+ If you want to continue training from a pre-trained model and save model checkpoints every 3 epochs, you can use:
150
+
151
+ ```bash
152
+ python main.py train --checkpoints_frequency 3 -ptmp path/to/pretrained/model.pth
153
+ ```
154
+
155
+ In this example, the command will resume training from the specified pre-trained model that located at 'path/to/pretrained/model.pth' and save checkpoints every 3 epochs. Training plots will be saved in the specified plots folder.
156
+
157
+ Remember to adjust the command options according to your training requirements and preferences. If you don't provide the `-ptmp` parameter, the command will start training from scratch using the default D-Nikud model architecture.
158
+
159
+ ## Acknowledgments
160
+
161
+ This script utilizes the D-Nikud model developed by [Adi Rosenthal](https://github.com/Adirosenthal540) and [Nadav Shaked](https://github.com/NadavShaked).
162
+
163
+ ## License
164
+
165
+ This code is provided under the [MIT License](https://www.mit.edu/~amini/LICENSE.md). You are free to use, modify, and distribute the code according to the terms of the license.