wissamantoun
commited on
Commit
·
c7ccbce
1
Parent(s):
9d991bf
Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,11 @@
|
|
1 |
---
|
2 |
language: ar
|
3 |
datasets:
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
-
|
|
|
8 |
widget:
|
9 |
- text: "يحكى أن مزارعا مخادعا قام ببيع بئر الماء الموجود في أرضه لجاره مقابل مبلغ كبير من المال"
|
10 |
- text: "القدس مدينة تاريخية، بناها الكنعانيون في"
|
@@ -36,6 +37,7 @@ from transformers import GPT2TokenizerFast, pipeline
|
|
36 |
#for base and medium
|
37 |
from transformers import GPT2LMHeadModel
|
38 |
#for large and mega
|
|
|
39 |
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
|
40 |
|
41 |
from arabert.preprocess import ArabertPreprocessor
|
@@ -50,7 +52,7 @@ model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
|
|
50 |
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
|
51 |
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)
|
52 |
|
53 |
-
#feel free to try different
|
54 |
generation_pipeline(text,
|
55 |
pad_token_id=tokenizer.eos_token_id,
|
56 |
num_beams=10,
|
@@ -68,7 +70,7 @@ Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-o
|
|
68 |
Create the Training TFRecords:
|
69 |
```bash
|
70 |
python create_pretraining_data.py
|
71 |
-
--input_file=<RAW TEXT FILE with documents/article
|
72 |
--output_file=<OUTPUT TFRecord>
|
73 |
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
|
74 |
```
|
@@ -133,7 +135,7 @@ The text generated by AraGPT2 is automatically generated by a neural network mod
|
|
133 |
```
|
134 |
|
135 |
# Acknowledgments
|
136 |
-
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the
|
137 |
|
138 |
# Contacts
|
139 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <[email protected]> | <[email protected]>
|
|
|
1 |
---
|
2 |
language: ar
|
3 |
datasets:
|
4 |
+
- wikipedia
|
5 |
+
- Osian
|
6 |
+
- 1.5B-Arabic-Corpus
|
7 |
+
- oscar-arabic-unshuffled
|
8 |
+
- Assafir(private)
|
9 |
widget:
|
10 |
- text: "يحكى أن مزارعا مخادعا قام ببيع بئر الماء الموجود في أرضه لجاره مقابل مبلغ كبير من المال"
|
11 |
- text: "القدس مدينة تاريخية، بناها الكنعانيون في"
|
|
|
37 |
#for base and medium
|
38 |
from transformers import GPT2LMHeadModel
|
39 |
#for large and mega
|
40 |
+
# pip install arabert
|
41 |
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
|
42 |
|
43 |
from arabert.preprocess import ArabertPreprocessor
|
|
|
52 |
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
|
53 |
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)
|
54 |
|
55 |
+
#feel free to try different decoding settings
|
56 |
generation_pipeline(text,
|
57 |
pad_token_id=tokenizer.eos_token_id,
|
58 |
num_beams=10,
|
|
|
70 |
Create the Training TFRecords:
|
71 |
```bash
|
72 |
python create_pretraining_data.py
|
73 |
+
--input_file=<RAW TEXT FILE with documents/article separated by an empty line>
|
74 |
--output_file=<OUTPUT TFRecord>
|
75 |
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
|
76 |
```
|
|
|
135 |
```
|
136 |
|
137 |
# Acknowledgments
|
138 |
+
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continuous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
|
139 |
|
140 |
# Contacts
|
141 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <[email protected]> | <[email protected]>
|