wasmdashai
/

vits-ar

@@ -9,9 +9,6 @@ pipeline_tag: text-to-speech
 ---
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
 ## Model Details
@@ -19,185 +16,94 @@ This modelcard aims to be a base template for new models. It has been generated
 <!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 # Model Card for Model ID
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->
+An advanced text-to-speech (TTS) system specifically designed for the Arabic language, built on the VITS architecture and utilizing the pre-trained weights from Facebook's vits ara model. The model is capable of:
+Generating natural and realistic speech: Producing high-quality Arabic speech that closely mimics human voices, preserving intonation and linguistic nuances.
+Understanding colloquial text: Processing text written in various Arabic dialects, including idiomatic expressions and local vocabulary.
+Model Details
+VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
+A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text.
+## Usage
+MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint,
+first install the latest version of the library:
+```
+pip install  transformers[torch]
+```
+Then, run inference with the following code-snippet:
+```python
+from transformers import VitsModel, AutoTokenizer
+import torch
+model = VitsModel.from_pretrained("wasmdashai/vits-ar")
+tokenizer = AutoTokenizer.from_pretrained("wasmdashai/vits-ar")
+text = "السلام عليكم ورحمة الله وبركاتة  ما الجديد ؟ "
+inputs = tokenizer(text, return_tensors="pt")
+with torch.no_grad():
+  full_generation =model(**inputs)
+full_generation_waveform = full_generation.waveform.cpu().numpy().reshape(-1)
+from IPython.display import Audio
+Audio(full_generation_waveform, rate=model.config.sampling_rate)
+```
+## Contact
+You can also email us at [email protected]
+## مجموعة نماذج توليد اللهجات العربية
+### مقدمة
+يسرنا أن نعلن عن إصدار مجموعة من نماذج توليد اللهجات العربية قريبًا. تم تصميم هذه النماذج باستخدام تقنيات الذكاء الاصطناعي المتقدمة لتقديم تجربة طبيعية وواقعية في تحويل النص إلى كلام (Text-to-Speech) بمختلف اللهجات العربية.
+### جدول النماذج
+| **اللهجة**        | **اسم النموذج**                                                                  | **الوصف**                                                                 | **تاريخ الإصدار المتوقع** | **مستوى جودة الصوت** |
+|-------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------|----------------------|
+|  اللغة العربية       | [vits-ar](https://huggingface.co/wasmdashai/vits-ar)                      | نموذج لتحويل النص إلى كلام باللهجة اليمنية بتفاصيل دقيقة.                  | متوفر                     | متوسط                |
+| اللهجة اليمنية      | [vits-ar-ye](https://huggingface.co/wasmdashai/vits-ar-ye)                      | نموذج لتحويل النص إلى كلام باللهجة اليمنية بتفاصيل دقيقة.                  | قريباً                     | متوسط                |
+| اللهجة السعودية    | [vits-ar-sa](https://huggingface.co/wasmdashai/vits-ar-sa-huba)                      | نموذج لتحويل النص إلى كلام باللهجة السعودية بجودة عالية وتفاصيل دقيقة.     | متوفر                     | متوسط                |
+| اللهجة المصرية     | [vits-ar-eg](https://huggingface.co/wasmdashai/vits-ar-eg)                      | نموذج لتحويل النص إلى كلام باللهجة المصرية بأسلوب طبيعي وسلس.              | قريباً                     | متوسط                |
+| اللهجة اللبنانية   | [vits-ar-lb](https://huggingface.co/wasmdashai/vits-ar-lb)                      | نموذج متخصص في اللهجة اللبنانية لتوليد كلام بتفاصيل دقيقة وواقعية.         | قريباً                     | متوسط                |
+| اللهجة المغربية    | [vits-ar-ma](https://huggingface.co/wasmdashai/vits-ar-ma)                      | نموذج لتحويل النص إلى كلام باللهجة المغربية بقدرة على فهم المصطلحات المحلية.| قريباً                     | متوسط                |
+| اللهجة الإماراتية  | [vits-ar-ae](https://huggingface.co/wasmdashai/vits-ar-ae)                      | نموذج لتحويل النص إلى كلام باللهجة الإماراتية بواقعية وتفاصيل دقيقة.        | قريباً                     | متوسط                |
+| اللهجة الأردنية     | [vits-ar-jo](https://huggingface.co/wasmdashai/vits-ar-jo)                      | نموذج لتحويل النص إلى كلام باللهجة الأردنية بإتقان للتفاصيل الصوتية.        | قريباً                     | متوسط                |
+| اللهجة العراقية     | [vits-ar-iq](https://huggingface.co/wasmdashai/vits-ar-iq)                      | نموذج لتوليد الكلام باللهجة العراقية بدقة في نطق الكلمات والتعابير الشائعة.  | قريباً                     | متوسط                |
+| اللهجة السورية      | [vits-ar-sy](https://huggingface.co/wasmdashai/vits-ar-sy)                      | نموذج لتحويل النص إلى كلام باللهجة السورية بوضوح وصوت طبيعي.               | قريباً                     | متوسط                |
+| اللهجة الفلسطينية  | [vits-ar-ps](https://huggingface.co/wasmdashai/vits-ar-ps)                      | نموذج لتحويل النص إلى كلام باللهجة الفلسطينية بتفاصيل دقيقة.               | قريباً                     | متوسط                |
+| اللهجة السودانية    | [vits-ar-sd](https://huggingface.co/wasmdashai/vits-ar-sd)                      | نموذج لتحويل النص إلى كلام باللهجة السودانية مع فهم المفردات المحلية.       | قريباً                     | متوسط                |
+| اللهجة الجزائرية    | [vits-ar-dz](https://huggingface.co/wasmdashai/vits-ar-dz)                      | نموذج لتحويل النص إلى كلام باللهجة الجزائرية بدقة وجودة عالية.              | قريباً                     | متوسط                |
+| اللهجة التونسية     | [vits-ar-tn](https://huggingface.co/wasmdashai/vits-ar-tn)                      | نموذج لتحويل النص إلى كلام باللهجة التونسية بإتقان للتفاصيل المحلية.         | قريباً                     | متوسط                |
+| اللهجة الليبية      | [vits-ar-ly](https://huggingface.co/wasmdashai/vits-ar-ly)                      | نموذج لتحويل النص إلى كلام باللهجة الليبية بدقة وواقعية في النطق.           | قريباً                     | متوسط                |
+| اللهجة البحرينية    | [vits-ar-bh](https://huggingface.co/wasmdashai/vits-ar-bh)                      | نموذج لتحويل النص إلى كلام باللهجة البحرينية بجودة صوت عالية.               | قريباً                     | متوسط                |
+| اللهجة العمانية     | [vits-ar-om](https://huggingface.co/wasmdashai/vits-ar-om)                      | نموذج لتحويل النص إلى كلام باللهجة العمانية بدقة ووضوح في النطق.             | قريباً                     | متوسط                |
+| اللهجة القطرية      | [vits-ar-qa](https://huggingface.co/wasmdashai/vits-ar-qa)                      | نموذج لتحويل النص إلى كلام باللهجة القطرية بتفاصيل دقيقة وواقعية.           | قريباً                     | متوسط                |
+| اللهجة الكويتية     | [vits-ar-kw](https://huggingface.co/wasmdashai/vits-ar-kw)                      | نموذج لتحويل النص إلى كلام باللهجة الكويتية بجودة عالية ووضوح.              | قريباً                     | متوسط                |
+| اللهجة الموريتانية  | [vits-ar-mr](https://huggingface.co/wasmdashai/vits-ar-mr)                      | نموذج لتحويل النص إلى كلام باللهجة الموريتانية بتفاصيل دقيقة وواقعية.       | قريباً                     | متوسط                |
+### التفاصيل الفنية
+تعتمد جميع النماذج على بنية VITS، وهي نموذج شامل لتحويل النص إلى كلام يتيح توليد موجات صوتية واقعية بناءً على المدخلات النصية. تحتوي النماذج على محولات لتحليل النص وتوليد الكلام بناءً على خصائص الصوت المحلية لكل لهجة.
+### الترقيات المستقبلية
+سيتم تقديم تحديثات منتظمة لتحسين جودة الصوت وزيادة كفاءة فهم اللهجات المختلفة. تابعونا لمعرفة المزيد حول تواريخ الإطلاق الدقيقة لكل نموذج.
+## Acknowledgements
+This implementation is based on [tts-arabic](https://github.com/nipponjo/tts-arabic-pytorch), [VITS](https://github.com/jaywalnut310/vits), [Finetune VITS](https://github.com/ylacombe/finetune-hf-vits) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their awesome work.