HamzaNaser
commited on
Commit
•
2862be7
1
Parent(s):
32c29c7
Update README.md
Browse files
README.md
CHANGED
@@ -42,8 +42,7 @@ Dialects-to-MSA-Transformer was Fine-Tuned on m2m100_418M, which consist of ~400
|
|
42 |
|
43 |
|
44 |
# Dataset
|
45 |
-
|
46 |
-
The Model was trained on `DialectsGeneration-Long-Finalized` Dataset, which consists of 0.8M of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
|
47 |
Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets , and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
|
48 |
|
49 |
|
@@ -102,10 +101,9 @@ Inspecting large paris of texts might be tedious, thus we have taken a sample of
|
|
102 |
| 3.0M | A100 | 1 | ??????? | ??????? |
|
103 |
|
104 |
## Costs and Resources
|
105 |
-
??????? Update costs
|
106 |
-
|
107 |
-
|
108 |
-
- OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 30$.
|
109 |
- GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$
|
110 |
|
111 |
|
|
|
42 |
|
43 |
|
44 |
# Dataset
|
45 |
+
The Model was trained on `Dialects-To-MSA-800K` Dataset, which consists of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
|
|
|
46 |
Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets , and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
|
47 |
|
48 |
|
|
|
101 |
| 3.0M | A100 | 1 | ??????? | ??????? |
|
102 |
|
103 |
## Costs and Resources
|
104 |
+
??????? Update costs ???????
|
105 |
+
There are two main computing resources when Dialects to MSA Transformer were built, one is the generation of MSA sequences using GPT model, the second resource is the GPU used to train and adjust the parameters of the pretrained Model.
|
106 |
+
- OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 35$.
|
|
|
107 |
- GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$
|
108 |
|
109 |
|