File size: 8,785 Bytes
3b26e12
cd97531
 
 
 
 
5ad2c0f
cdc8965
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
- he
tags:
- roberta
- Language model
pipeline_tag: text-generation
---

# D-Nikud

Welcome to the D-Nikud Diacritization Model main code repository! This repository is dedicated to the implementation of our innovative D-Nikud model, which use the TavBERT architecture and Bi-LSTM to predict and apply diacritics (nikud) to Hebrew text. Diacritics play a crucial role in accurately conveying pronunciation and interpretation, making our model an essential tool for enhancing the quality of Hebrew text analysis.

The code provided here encompasses various functionalities, including prediction, evaluation, and training of the D-Nikud diacritization model. 

Repository for the paper [D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models](https://arxiv.org/abs/2402.00075) by Nadav Shaked and Adi Rosenthal.

## Prerequisites

Before running the script, make sure you have the following installed:

- Tested with Python 3.10
- `torch` library (PyTorch)
- `transformers` library
- Required Python packages (Install using `pip install -r requirements.txt`)

## Table of Contents
- [Introduction](#introduction)
- [Pre-Trained model](#Pre-Trained-model)
- [Usage](#usage)
  - [Predict](#predict)
  - [Evaluate](#evaluate)
  - [Train](#train)
- [Requirements](#requirements)
- [License](#license)

## Introduction

Our D-Nikud model utilizes the TevBERT architecture and Bi-LSTM for diacritization (nikud) of Hebrew text. Diacritics (nikud) are essential for accurate pronunciation and interpretation of the text. This repository provides the core code for implementing and utilizing the D-Nikud model.

## Pre Trained model

Our pre-trained D-Nikud model can be found at [Link](https://drive.google.com/drive/folders/1osK503txvsEWlZASBViSqOiNJMzAlD0F). To use it, unzip the downloaded file and copy the contents to the 'models' folder.

## Usage

Clone the repository:

   ```bash
   git clone https://github.com/NadavShaked/D_Nikud.git
   cd D-Nikud
   ```
Clone D-Nikud data:
   ```bash
   git submodule update --init --recursive
   ```


### Predict

The "Predict" command enables the prediction of diacritics for input text files or folders containing diacritized or un-diacritized text. It generates diacritization predictions using the specified diacritization model and saves the results to the specified output file. Optionally, you can choose to predict text for comparison with Nakdimon using the `-c/--compare` flag.
To predict diacritics for input text files or folders, use the following command:

```bash
python main.py predict <input_path> <output_path> [-c/--compare <compare_nakdimon>] [-ptmp/--pretrain_model_path <pretrain_model_path>]
```

- `<input_path>`: Path to the input file or folder containing text data.
- `<output_path>`: Path to the output file where the predicted diacritized text will be saved.
- `-c/--compare`: Optional. Set to `True` to predict text for comparison with Nakdimon.
- `-ptmp/--pretrain_model_path`: Optional. Path to the pre-trained model weights to be used for prediction. If not provided, the command will default to using our pre-trained D-Nikud model.

For example, to predict diacritics for a specific input text file and save the results to an output file, you can execute:

```bash
python main.py predict input.txt output.txt
```

If you wish to predict text for comparison with Nakdimon and specify a custom pre-trained model path, you can use:

```bash
python main.py predict input_folder output_folder -c True -ptmp path/to/pretrained/model.pth
```

Here, the command will predict diacritics for the texts in the `input_folder`, generate output files in the `output_folder`, and use the specified pre-trained model for prediction.

You can adapt the paths and options to suit your project's requirements. If the -ptmp parameter is omitted, the command will automatically employ our default pre-trained D-Nikud model for prediction.

### Evaluate

The "Evaluate" command assesses the performance of the diacritization model by computing accuracy metrics for specific diacritics elements: nikud, dagesh, sin, as well as overall letter and word accuracy. This evaluation process involves comparing the model's diacritization results with the original diacritics text, providing insights into the model's effectiveness in accurately predicting and applying diacritics.

To evaluate the diacritization model, you can use the following command:

```bash
python main.py evaluate <input_path> [-ptmp/--pretrain_model_path <pretrain_model_path>] [-df/--plots_folder <plots_folder>] [-es/--eval_sub_folders]
```

- `<input_path>`: Path to the input file or folder containing text data for evaluation.
- `-ptmp/--pretrain_model_path`: Optional. Path to the pre-trained model weights to be employed for evaluation. If this parameter is not specified, the command will default to using our pre-trained D-Nikud model.
- `-df/--plots_folder`: Optional. Path to the folder where evaluation plots will be saved. If not provided, the default plots folder will be used.
- `-es/--eval_sub_folders`: Optional. Include this flag to enable accuracy calculation for sub-folders within the `input_path` folder, providing independent assessments for each subfolder.

For example, to evaluate the diacritization model's performance on a specific dataset, you might run:

```bash
python main.py evaluate dataset_folder -ptmp path/to/pretrained/model.pth -df evaluation_plots
```

This command will evaluate the model's accuracy on the dataset found in the `dataset_folder`, using the specified pre-trained model weights and saving evaluation plots in the `evaluation_plots` folder.

### Train

The "Train" command enables the training of the diacritization model using your own dataset. This command supports fine-tuning a pre-trained model, adjusting hyperparameters such as learning rate and batch size, and specifying various training settings.

⚠️ **Important Note:** Any file or folder in the specified data folder that contains the string "not_use" or "NakdanResults" in its name will be excluded from the training and testing processes. This feature allows you to selectively exclude specific data from the training process if needed.

To train the diacritization model, use the following command:

```bash
python main.py train [--learning_rate <learning_rate>] [--batch_size <batch_size>]
                    [--n_epochs <n_epochs>] [--data_folder <data_folder>] [--checkpoints_frequency <checkpoints_frequency>]
                    [-df/--plots_folder <plots_folder>] [-ptmp/--pretrain_model_path <pretrain_model_path>]
```

- `--learning_rate`: Optional. Learning rate for training (default is 0.001).
- `--batch_size`: Optional. Batch size for training (default is 32).
- `--n_epochs`: Optional. Number of training epochs (default is 10).
- `--data_folder`: Optional. Path to the folder containing training data (default is "data").
- `--checkpoints_frequency`: Optional. Frequency of saving model checkpoints during training (default is 1).
- `-df/--plots_folder`: Optional. Path to the folder where training plots will be saved.
- `-ptmp/--pretrain_model_path`: Optional. Path to the pre-trained model weights to be used for training continuation. Use this only if you want to fine-tune a specific pre-trained model.

⚠️ **Folder Structure:** The `--data_folder` must have the following structure:
- **data_folder**
  - **train**
    - Contains training data
  - **dev**
    - Contains development/validation data
  - **test**
    - Contains testing data

For instance, to initiate training with a specified learning rate, batch size, and number of epochs, you can execute:

```bash
python main.py train --learning_rate 0.001 --batch_size 16 --n_epochs 20
```

If you want to continue training from a pre-trained model and save model checkpoints every 3 epochs, you can use:

```bash
python main.py train --checkpoints_frequency 3 -ptmp path/to/pretrained/model.pth
```

In this example, the command will resume training from the specified pre-trained model that located at 'path/to/pretrained/model.pth' and save checkpoints every 3 epochs. Training plots will be saved in the specified plots folder.

Remember to adjust the command options according to your training requirements and preferences. If you don't provide the `-ptmp` parameter, the command will start training from scratch using the default D-Nikud model architecture.

## Acknowledgments

This script utilizes the D-Nikud model developed by [Adi Rosenthal](https://github.com/Adirosenthal540) and [Nadav Shaked](https://github.com/NadavShaked).

## License

This code is provided under the [MIT License](https://www.mit.edu/~amini/LICENSE.md). You are free to use, modify, and distribute the code according to the terms of the license.