wissamantoun commited on
Commit
1ac2d1f
1 Parent(s): e5f5983

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -142
README.md CHANGED
@@ -1,142 +1,144 @@
1
- ---
2
- language: ar
3
- datasets:
4
- - wikipedia
5
- - OSIAN
6
- - 1.5B Arabic Corpus
7
- - OSCAR Arabic Unshuffled
8
- widget:
9
- - text: "يحكى أن مزارعا مخادعا قام ببيع بئر الماء الموجود في أرضه لجاره مقابل مبلغ كبير من المال"
10
- - text: "القدس مدينة تاريخية، بناها الكنعانيون في"
11
- - text: "كان يا ما كان في قديم الزمان"
12
- ---
13
-
14
- # Arabic GPT2
15
-
16
- <img src="https://raw.githubusercontent.com/aub-mind/arabert/master/AraGPT2.png" width="100" align="left"/>
17
-
18
- You can find more information in our paper [AraGPT2](https://arxiv.org/abs/2012.15520)
19
-
20
- The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
21
-
22
- GPT2-base and medium uses the code from the `gpt2` folder and can trains models from the [minimaxir/gpt-2-simple](https://github.com/minimaxir/gpt-2-simple) repository.
23
- These models were trained using the `lamb` optimizer and follow the same architecture as `gpt2` and are fully compatible with the `transformers` library.
24
-
25
- GPT2-large and GPT2-mega were trained using the [imcaspar/gpt2-ml](https://github.com/imcaspar/gpt2-ml/) library, and follow the `grover` architecture. You can use the pytorch classes found in `grover/modeling_gpt2.py` as a direct replacement for classes in the `transformers` library (it should support version `v4.x` from `transformers`).
26
- Both models are trained using the `adafactor` optimizer, since the `adam` and `lamb` optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.
27
-
28
- AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
29
-
30
- # Usage
31
-
32
- ## Testing the model using `transformers`:
33
-
34
- ```python
35
- from transformers import GPT2TokenizerFast, pipeline
36
- #for base and medium
37
- from transformers import GPT2LMHeadModel
38
- #for large and mega
39
- from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
40
-
41
- from arabert.preprocess import ArabertPreprocessor
42
-
43
- MODEL_NAME='aubmindlab/aragpt2-base'
44
- arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
45
-
46
- text=""
47
- text_clean = arabert_prep.preprocess(text)
48
-
49
- model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
50
- tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
51
- generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)
52
-
53
- #feel free to try different decodinn settings
54
- generation_pipeline(text,
55
- pad_token_id=tokenizer.eos_token_id,
56
- num_beams=10,
57
- max_length=200,
58
- top_p=0.9,
59
- repetition_penalty = 3.0,
60
- no_repeat_ngram_size = 3)[0]['generated_text']
61
- ```
62
- ## Finetunning using `transformers`:
63
-
64
- Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-on-colab-gpu-for-free-340468c92ed)
65
-
66
- ## Finetuning using our code with TF 1.15.4:
67
-
68
- Create the Training TFRecords:
69
- ```bash
70
- python create_pretraining_data.py
71
- --input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
72
- --output_file=<OUTPUT TFRecord>
73
- --tokenizer_dir=<Directory with the GPT2 Tokenizer files>
74
- ```
75
-
76
- Finetuning:
77
- ```bash
78
- python3 run_pretraining.py \\r\n --input_file="gs://<GS_BUCKET>/pretraining_data/*" \\r\n --output_dir="gs://<GS_BUCKET>/pretraining_model/" \\r\n --config_file="config/small_hparams.json" \\r\n --batch_size=128 \\r\n --eval_batch_size=8 \\r\n --num_train_steps= \\r\n --num_warmup_steps= \\r\n --learning_rate= \\r\n --save_checkpoints_steps= \\r\n --max_seq_length=1024 \\r\n --max_eval_steps= \\r\n --optimizer="lamb" \\r\n --iterations_per_loop=5000 \\r\n --keep_checkpoint_max=10 \\r\n --use_tpu=True \\r\n --tpu_name=<TPU NAME> \\r\n --do_train=True \\r\n --do_eval=False
79
- ```
80
- # Model Sizes
81
-
82
- Model | Optimizer | Context size | Embedding Size | Num of heads | Num of layers | Model Size / Num of Params |
83
- ---|:---:|:---:|:---:|:---:|:---:|:---:
84
- AraGPT2-base | `lamb` | 1024 | 768 | 12 | 12 | 527MB / 135M |
85
- AraGPT2-medium | `lamb` | 1024 | 1024 | 16 | 24 | 1.38G/370M |
86
- AraGPT2-large | `adafactor` | 1024 | 1280 | 20 | 36 | 2.98GB/792M |
87
- AraGPT2-mega | `adafactor` | 1024 | 1536 | 25 | 48 | 5.5GB/1.46B |
88
-
89
- All models are available in the `HuggingFace` model page under the [aubmindlab](https://huggingface.co/aubmindlab/) name. Checkpoints are available in PyTorch, TF2 and TF1 formats.
90
-
91
- ## Compute
92
-
93
- Model | Hardware | num of examples (seq len = 1024) | Batch Size | Num of Steps | Time (in days)
94
- ---|:---:|:---:|:---:|:---:|:---:
95
- AraGPT2-base | TPUv3-128 | 9.7M | 1792 | 125K | 1.5
96
- AraGPT2-medium | TPUv3-8 | 9.7M | 1152 | 85K | 1.5
97
- AraGPT2-large | TPUv3-128 | 9.7M | 256 | 220k | 3
98
- AraGPT2-mega | TPUv3-128 | 9.7M | 256 | 780K | 9
99
-
100
- # Dataset
101
-
102
- The pretraining data used for the new AraGPT2 model is also used for **AraBERTv2 and AraELECTRA**.
103
-
104
- The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
105
-
106
- For the new dataset we added the unshuffled OSCAR corpus after we thoroughly filter it, to the dataset used in AraBERTv1 but without the websites that we previously crawled:
107
- - OSCAR unshuffled and filtered.
108
- - [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
109
- - [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
110
- - [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619)
111
- - Assafir news articles. Huge thank you for Assafir for giving us the data
112
-
113
- # Disclaimer
114
-
115
- The text generated by AraGPT2 is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by AraGPT2 should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.
116
-
117
- # If you used this model please cite us as :
118
-
119
- ```
120
- @inproceedings{antoun-etal-2021-aragpt2,
121
- title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
122
- author = "Antoun, Wissam and
123
- Baly, Fady and
124
- Hajj, Hazem",
125
- booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
126
- month = apr,
127
- year = "2021",
128
- address = "Kyiv, Ukraine (Virtual)",
129
- publisher = "Association for Computational Linguistics",
130
- url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
131
- pages = "196--207",
132
- }
133
- ```
134
-
135
- # Acknowledgments
136
- Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
137
-
138
- # Contacts
139
- **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <[email protected]> | <[email protected]>
140
-
141
- **Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
142
-
 
 
 
1
+ ---
2
+ language: ar
3
+ datasets:
4
+ - wikipedia
5
+ - Osian
6
+ - 1.5B-Arabic-Corpus
7
+ - oscar-arabic-unshuffled
8
+ - Assafir(private)
9
+ widget:
10
+ - text: "يحكى أن مزارعا مخادعا قام ببيع بئر الماء الموجود في أرضه لجاره مقابل مبلغ كبير من المال"
11
+ - text: "القدس مدينة تاريخية، بناها الكنعانيون في"
12
+ - text: "كان يا ما كان في قديم الزمان"
13
+ ---
14
+
15
+ # Arabic GPT2
16
+
17
+ <img src="https://raw.githubusercontent.com/aub-mind/arabert/master/AraGPT2.png" width="100" align="left"/>
18
+
19
+ You can find more information in our paper [AraGPT2](https://arxiv.org/abs/2012.15520)
20
+
21
+ The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
22
+
23
+ GPT2-base and medium uses the code from the `gpt2` folder and can trains models from the [minimaxir/gpt-2-simple](https://github.com/minimaxir/gpt-2-simple) repository.
24
+ These models were trained using the `lamb` optimizer and follow the same architecture as `gpt2` and are fully compatible with the `transformers` library.
25
+
26
+ GPT2-large and GPT2-mega were trained using the [imcaspar/gpt2-ml](https://github.com/imcaspar/gpt2-ml/) library, and follow the `grover` architecture. You can use the pytorch classes found in `grover/modeling_gpt2.py` as a direct replacement for classes in the `transformers` library (it should support version `v4.x` from `transformers`).
27
+ Both models are trained using the `adafactor` optimizer, since the `adam` and `lamb` optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.
28
+
29
+ AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
30
+
31
+ # Usage
32
+
33
+ ## Testing the model using `transformers`:
34
+
35
+ ```python
36
+ from transformers import GPT2TokenizerFast, pipeline
37
+ #for base and medium
38
+ from transformers import GPT2LMHeadModel
39
+ #for large and mega
40
+ # pip install arabert
41
+ from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
42
+
43
+ from arabert.preprocess import ArabertPreprocessor
44
+
45
+ MODEL_NAME='aubmindlab/aragpt2-base'
46
+ arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
47
+
48
+ text=""
49
+ text_clean = arabert_prep.preprocess(text)
50
+
51
+ model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
52
+ tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
53
+ generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)
54
+
55
+ #feel free to try different decoding settings
56
+ generation_pipeline(text,
57
+ pad_token_id=tokenizer.eos_token_id,
58
+ num_beams=10,
59
+ max_length=200,
60
+ top_p=0.9,
61
+ repetition_penalty = 3.0,
62
+ no_repeat_ngram_size = 3)[0]['generated_text']
63
+ ```
64
+ ## Finetunning using `transformers`:
65
+
66
+ Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-on-colab-gpu-for-free-340468c92ed)
67
+
68
+ ## Finetuning using our code with TF 1.15.4:
69
+
70
+ Create the Training TFRecords:
71
+ ```bash
72
+ python create_pretraining_data.py
73
+ --input_file=<RAW TEXT FILE with documents/article separated by an empty line>
74
+ --output_file=<OUTPUT TFRecord>
75
+ --tokenizer_dir=<Directory with the GPT2 Tokenizer files>
76
+ ```
77
+
78
+ Finetuning:
79
+ ```bash
80
+ python3 run_pretraining.py \\r\n --input_file="gs://<GS_BUCKET>/pretraining_data/*" \\r\n --output_dir="gs://<GS_BUCKET>/pretraining_model/" \\r\n --config_file="config/small_hparams.json" \\r\n --batch_size=128 \\r\n --eval_batch_size=8 \\r\n --num_train_steps= \\r\n --num_warmup_steps= \\r\n --learning_rate= \\r\n --save_checkpoints_steps= \\r\n --max_seq_length=1024 \\r\n --max_eval_steps= \\r\n --optimizer="lamb" \\r\n --iterations_per_loop=5000 \\r\n --keep_checkpoint_max=10 \\r\n --use_tpu=True \\r\n --tpu_name=<TPU NAME> \\r\n --do_train=True \\r\n --do_eval=False
81
+ ```
82
+ # Model Sizes
83
+
84
+ Model | Optimizer | Context size | Embedding Size | Num of heads | Num of layers | Model Size / Num of Params |
85
+ ---|:---:|:---:|:---:|:---:|:---:|:---:
86
+ AraGPT2-base | `lamb` | 1024 | 768 | 12 | 12 | 527MB / 135M |
87
+ AraGPT2-medium | `lamb` | 1024 | 1024 | 16 | 24 | 1.38G/370M |
88
+ AraGPT2-large | `adafactor` | 1024 | 1280 | 20 | 36 | 2.98GB/792M |
89
+ AraGPT2-mega | `adafactor` | 1024 | 1536 | 25 | 48 | 5.5GB/1.46B |
90
+
91
+ All models are available in the `HuggingFace` model page under the [aubmindlab](https://huggingface.co/aubmindlab/) name. Checkpoints are available in PyTorch, TF2 and TF1 formats.
92
+
93
+ ## Compute
94
+
95
+ Model | Hardware | num of examples (seq len = 1024) | Batch Size | Num of Steps | Time (in days)
96
+ ---|:---:|:---:|:---:|:---:|:---:
97
+ AraGPT2-base | TPUv3-128 | 9.7M | 1792 | 125K | 1.5
98
+ AraGPT2-medium | TPUv3-8 | 9.7M | 1152 | 85K | 1.5
99
+ AraGPT2-large | TPUv3-128 | 9.7M | 256 | 220k | 3
100
+ AraGPT2-mega | TPUv3-128 | 9.7M | 256 | 780K | 9
101
+
102
+ # Dataset
103
+
104
+ The pretraining data used for the new AraGPT2 model is also used for **AraBERTv2 and AraELECTRA**.
105
+
106
+ The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
107
+
108
+ For the new dataset we added the unshuffled OSCAR corpus after we thoroughly filter it, to the dataset used in AraBERTv1 but without the websites that we previously crawled:
109
+ - OSCAR unshuffled and filtered.
110
+ - [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
111
+ - [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
112
+ - [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619)
113
+ - Assafir news articles. Huge thank you for Assafir for giving us the data
114
+
115
+ # Disclaimer
116
+
117
+ The text generated by AraGPT2 is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by AraGPT2 should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.
118
+
119
+ # If you used this model please cite us as :
120
+
121
+ ```
122
+ @inproceedings{antoun-etal-2021-aragpt2,
123
+ title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
124
+ author = "Antoun, Wissam and
125
+ Baly, Fady and
126
+ Hajj, Hazem",
127
+ booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
128
+ month = apr,
129
+ year = "2021",
130
+ address = "Kyiv, Ukraine (Virtual)",
131
+ publisher = "Association for Computational Linguistics",
132
+ url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
133
+ pages = "196--207",
134
+ }
135
+ ```
136
+
137
+ # Acknowledgments
138
+ Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continuous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
139
+
140
+ # Contacts
141
+ **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
142
+
143
+ **Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <[email protected]> | <[email protected]>
144
+