HamzaNaser
/

Dialects-to-MSA-Transformer

Text2Text Generation

Dialects Conversion

Text Correction

En-Ar Transtaltion

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

HamzaNaser commited on Aug 31

Commit

2862be7

•

1 Parent(s): 32c29c7

Update README.md

Files changed (1) hide show

README.md +4 -6

README.md CHANGED Viewed

@@ -42,8 +42,7 @@ Dialects-to-MSA-Transformer was Fine-Tuned on m2m100_418M, which consist of ~400
 # Dataset
-??????? Update the name of the dataset ???????
-The Model was trained on `DialectsGeneration-Long-Finalized` Dataset, which consists of 0.8M of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
 Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets , and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
@@ -102,10 +101,9 @@ Inspecting large paris of texts might be tedious, thus we have taken a sample of
 | 3.0M          | A100       | 1          | ???????         | ???????      |
 ## Costs and Resources
-??????? Update costs as per exact used after finishing the model ???????
-??????? Adjust GPU costs after finishing up training ???????
-There are two main computing resources when building Dialects to MSA Transformer, one is the generation of MSA sequences using GPT model, the second resource is the GPU used to train and adjust the parameters of the pretrained Model.
-- OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 30$.
 - GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$

 # Dataset
+The Model was trained on `Dialects-To-MSA-800K` Dataset, which consists of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
 Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets , and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
 | 3.0M          | A100       | 1          | ???????         | ???????      |
 ## Costs and Resources
+??????? Update costs ???????
+There are two main computing resources when Dialects to MSA Transformer were built, one is the generation of MSA sequences using GPT model, the second resource is the GPU used to train and adjust the parameters of the pretrained Model.
+- OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 35$.
 - GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$