HamzaNaser commited on
Commit
2862be7
1 Parent(s): 32c29c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -6
README.md CHANGED
@@ -42,8 +42,7 @@ Dialects-to-MSA-Transformer was Fine-Tuned on m2m100_418M, which consist of ~400
42
 
43
 
44
  # Dataset
45
- ??????? Update the name of the dataset ???????
46
- The Model was trained on `DialectsGeneration-Long-Finalized` Dataset, which consists of 0.8M of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
47
  Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets , and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
48
 
49
 
@@ -102,10 +101,9 @@ Inspecting large paris of texts might be tedious, thus we have taken a sample of
102
  | 3.0M | A100 | 1 | ??????? | ??????? |
103
 
104
  ## Costs and Resources
105
- ??????? Update costs as per exact used after finishing the model ???????
106
- ??????? Adjust GPU costs after finishing up training ???????
107
- There are two main computing resources when building Dialects to MSA Transformer, one is the generation of MSA sequences using GPT model, the second resource is the GPU used to train and adjust the parameters of the pretrained Model.
108
- - OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 30$.
109
  - GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$
110
 
111
 
 
42
 
43
 
44
  # Dataset
45
+ The Model was trained on `Dialects-To-MSA-800K` Dataset, which consists of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
 
46
  Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets , and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
47
 
48
 
 
101
  | 3.0M | A100 | 1 | ??????? | ??????? |
102
 
103
  ## Costs and Resources
104
+ ??????? Update costs ???????
105
+ There are two main computing resources when Dialects to MSA Transformer were built, one is the generation of MSA sequences using GPT model, the second resource is the GPU used to train and adjust the parameters of the pretrained Model.
106
+ - OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 35$.
 
107
  - GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$
108
 
109