ArBanking77 / README.md

naghamghanim

Update README.md

bb0e161 verified 8 months ago

preview code

raw

history blame contribute delete

No virus

4.73 kB

	---
	license: mit
	language:
	- ar
	metrics:
	- f1
	pipeline_tag: text-classification
	tags:
	- code
	datasets:
	- SinaLab/ArBanking77
	---



	ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic
	======================
	ArBanking77 is an MSA and Dialectal Arabic Corpus for Arabic Intent Detection in Banking Domain. It consists of 31,404
	samples (MSA and Palestinian dialects). This repo contains the source-code and sample dataset to train and evaluate
	Arabic Intent Detection model.


	ArBanking77 Corpus
	--------
	ArBanking77 consists of 31,404 (MSA and Palestinian dialects) that are manually Arabized and localized from the original
	English Banking77 dataset; which consists of 13,083 queries. Each query is classified into one of the 77 classes (
	intents) including card arrival, card linking, exchange rate, and automatic top-up. You can find the list of these 77
	intents in the `./data/Banking77_intents.csv` file. A neural model based on AraBERT was fine-tuned on the ArBanking77
	dataset (F1-score 92% for MSA, 90% for PAL)


	Full Corpus Download
	--------
	A sample data is available in the `data` directory. However, the entire ArBanking77 corpus is
	available to download upon request for academic and commercial use. However, we cannot provide the augmented data.

	[Request to download ArBanking77 (corpus and the model)](https://sina.birzeit.edu/arbanking77/)


	Model Download
	--------
	[SinaLab HuggingFace](https://huggingface.co./SinaLab/ArBanking77)

	Online Demo
	--------
	You can try our model using this [demo link](https://sina.birzeit.edu/arbanking77/).

	Requirements
	--------
	At this point, the code is compatible with `Python 3.11`

	Clone this repo

	git clone https://github.com/SinaLab/ArabicNER.git

	This package has dependencies on multiple Python packages. It is recommended to Conda to create a new environment
	that mimics the same environment the model was trained in. Provided in this repo `requirements.txt` from which you
	can create a new conda environment using the command below.

	conda create -n env_name python=3.11

	Install requirements using pip command:

	pip install -r requirements.txt


	Project Structure
	--------
	```
	.
	├── data <- data dir
	│ ├── Banking77_Arabized_MSA_PAL_train_sample.csv
	│ ├── Banking77_Arabized_MSA_PAL_val_sample.csv
	│ ├── Banking77_Arabized_MSA_test_sample.csv
	│ ├── Banking77_Arabized_PAL_test_sample.csv
	│ ├── Banking77_intents.csv
	├── outputs
	│ ├── models <- trained models
	│ ├── results <- evaluation results and reports
	├── src <- training and evaluation scripts
	│ ├── run_glue_no_trainer.py
	│ ├── run_glue_no_trainer_eval.py
	│ └── utils.py
	├── .gitignore
	├── LICENSE
	├── README.md
	└── requirements.txt
	```

	Model Training
	--------
	You can start model training by running the following command. It's recommended to pass the arguments demonstrated below
	to get results similar to the ones reported in the paper.

	python ./src/run_glue_no_trainer.py
	--model_name_or_path aubmindlab/bert-base-arabertv02
	--train_file ./data/Banking77_Arabized_MSA_PAL_train_sample.csv
	--validation_file ./data/Banking77_Arabized_MSA_PAL_val_sample.csv
	--seed 42
	--max_length 128
	--learning_rate 4e-5
	--num_train_epochs 20
	--per_device_train_batch_size 64
	--output_dir ./outputs/models

	Evaluation
	--------
	Additionally, you can evaluate the trained model on `Banking77_Arabized_MSA_test_sample.csv` and `Banking77_Arabized_PAL_test_sample.csv` test sets as follows:

	python ./src/run_glue_no_trainer_eval.py
	--model_name_or_path ./outputs/models
	--validation_file ./data/Banking77_Arabized_MSA_test_sample.csv
	--seed 42
	--per_device_eval_batch_size 64
	--results_dir ./outputs/results
	--log_path ./outputs/logs/log.txt

	Credits
	-------
	This research was funded by the Palestinian Higher Council for Innovation and Excellence and the Scientific and
	Technological Research Council of Türkiye (TÜBİTAK) under project No. 120N761 - CONVERSER: Conversational AI System for
	Arabic.


	Citation
	-------
	Mustafa Jarrar, Ahmet Birim, Mohammed Khalilia, Mustafa Erden, and Sana
	Ghanem: [ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic](http://www.jarrar.info/publications/JBKEG23.pdf).
	In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.

	https://arxiv.org/abs/2310.19034