File size: 4,733 Bytes
ccfba36
 
 
 
 
 
 
 
 
aef18bc
 
ccfba36
e1af151
bb0e161
e1af151
7465e7d
a7ec662
 
 
 
e1af151
 
 
 
a7ec662
 
 
 
 
e1af151
 
a7ec662
e1af151
a7ec662
 
 
 
e1af151
 
 
 
a7ec662
e1af151
a7ec662
 
 
e1af151
a7ec662
e1af151
a7ec662
 
 
 
 
e1af151
a7ec662
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e1af151
 
a7ec662
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e1af151
a7ec662
 
 
 
 
 
 
e1af151
 
 
a7ec662
 
 
e1af151
 
 
 
a7ec662
 
bb0e161
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: mit
language:
- ar
metrics:
- f1
pipeline_tag: text-classification
tags:
- code
datasets:
- SinaLab/ArBanking77
---



ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic
======================
ArBanking77 is an MSA and Dialectal Arabic Corpus for Arabic Intent Detection in Banking Domain. It consists of 31,404
samples (MSA and Palestinian dialects). This repo contains the source-code and sample dataset to train and evaluate
Arabic Intent Detection model.


ArBanking77 Corpus
--------
ArBanking77 consists of 31,404 (MSA and Palestinian dialects) that are manually Arabized and localized from the original
English Banking77 dataset; which consists of 13,083 queries. Each query is classified into one of the 77 classes (
intents) including card arrival, card linking, exchange rate, and automatic top-up. You can find the list of these 77
intents in the `./data/Banking77_intents.csv` file. A neural model based on AraBERT was fine-tuned on the ArBanking77
dataset (F1-score 92% for MSA, 90% for PAL)


Full Corpus Download
--------
A sample data is available in the `data` directory. However, the entire ArBanking77 corpus is
available to download upon request for academic and commercial use. However, we cannot provide the augmented data.

[Request to download ArBanking77 (corpus and the model)](https://sina.birzeit.edu/arbanking77/)


Model Download
--------
[SinaLab HuggingFace](https://huggingface.co./SinaLab/ArBanking77)

Online Demo
--------
You can try our model using this [demo link](https://sina.birzeit.edu/arbanking77/).

Requirements
--------
At this point, the code is compatible with `Python 3.11`

Clone this repo

    git clone https://github.com/SinaLab/ArabicNER.git

This package has dependencies on multiple Python packages. It is recommended to Conda to create a new environment
that mimics the same environment the model was trained in. Provided in this repo `requirements.txt` from which you
can create a new conda environment using the command below.

    conda create -n env_name python=3.11

Install requirements using pip command:

    pip install -r requirements.txt


Project Structure
--------
```
.
β”œβ”€β”€ data                            <- data dir
β”‚   β”œβ”€β”€ Banking77_Arabized_MSA_PAL_train_sample.csv
β”‚   β”œβ”€β”€ Banking77_Arabized_MSA_PAL_val_sample.csv
β”‚   β”œβ”€β”€ Banking77_Arabized_MSA_test_sample.csv
β”‚   β”œβ”€β”€ Banking77_Arabized_PAL_test_sample.csv
β”‚   β”œβ”€β”€ Banking77_intents.csv
β”œβ”€β”€ outputs
β”‚   β”œβ”€β”€ models                      <- trained models
β”‚   β”œβ”€β”€ results                     <- evaluation results and reports
β”œβ”€β”€ src                             <- training and evaluation scripts
β”‚   β”œβ”€β”€ run_glue_no_trainer.py
β”‚   β”œβ”€β”€ run_glue_no_trainer_eval.py
β”‚   └── utils.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
└── requirements.txt
```

Model Training
--------
You can start model training by running the following command. It's recommended to pass the arguments demonstrated below
to get results similar to the ones reported in the paper.

    python ./src/run_glue_no_trainer.py
        --model_name_or_path aubmindlab/bert-base-arabertv02 
        --train_file ./data/Banking77_Arabized_MSA_PAL_train_sample.csv
        --validation_file ./data/Banking77_Arabized_MSA_PAL_val_sample.csv 
        --seed 42 
        --max_length 128 
        --learning_rate 4e-5 
        --num_train_epochs 20 
        --per_device_train_batch_size 64 
        --output_dir ./outputs/models

Evaluation
--------
Additionally, you can evaluate the trained model on `Banking77_Arabized_MSA_test_sample.csv` and `Banking77_Arabized_PAL_test_sample.csv` test sets as follows:

    python ./src/run_glue_no_trainer_eval.py 
        --model_name_or_path ./outputs/models 
        --validation_file ./data/Banking77_Arabized_MSA_test_sample.csv 
        --seed 42 
        --per_device_eval_batch_size 64 
        --results_dir ./outputs/results 
        --log_path ./outputs/logs/log.txt

Credits
-------
This research was funded by the Palestinian Higher Council for Innovation and Excellence and the Scientific and
Technological Research Council of Türkiye (TÜBİTAK) under project No. 120N761 - CONVERSER: Conversational AI System for
Arabic.


Citation
-------
Mustafa Jarrar, Ahmet Birim, Mohammed Khalilia, Mustafa Erden, and Sana
Ghanem: [ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic](http://www.jarrar.info/publications/JBKEG23.pdf).
In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.

https://arxiv.org/abs/2310.19034