Benjamin-png
/

swahili-mms-tts-finetuned

Safetensors

vits

Model card Files Files and versions Community

Benjamin-png commited on Sep 14

Commit

bb5aebf

•

1 Parent(s): fec23e8

Update README.md

Browse files

Files changed (1) hide show

README.md +70 -5

README.md CHANGED Viewed

@@ -1,6 +1,3 @@
----
-license: apache-2.0
----
 # Swahili MMS TTS - Finetuned Model
@@ -22,7 +19,9 @@ You can check out the code and process used in the fine-tuning by visiting the [
 ## How to Use
-You can load and use the model directly from the Hugging Face model hub:
 ```python
 from transformers import pipeline
@@ -34,6 +33,72 @@ tts = pipeline("text-to-speech", model="Benjamin-png/swahili-mms-tts-finetuned")
 speech = tts("Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili.")
 ```
 ## Example Notebook
 If you're interested in reproducing the fine-tuning process or using the model for similar purposes, you can check out the Google Colab notebook that outlines the entire process:
@@ -48,5 +113,5 @@ For further exploration and code snippets, visit the [GitHub repository](https:/
 ## License
-This project is licensed under the terms of the apache License.

 # Swahili MMS TTS - Finetuned Model
 ## How to Use
+You can load and use the model directly from the Hugging Face model hub using either the `pipeline` API or by manually downloading the model and tokenizer.
+### 1. Using the `pipeline` API
 ```python
 from transformers import pipeline
 speech = tts("Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili.")
 ```
+### 2. Download and Run the Model Directly
+You can also download the model and tokenizer manually and run the text-to-speech pipeline without the Hugging Face `pipeline` helper. Here's how:
+```python
+import torch
+import numpy as np
+import scipy.io.wavfile
+from transformers import AutoTokenizer
+from vits_model import VitsModel  # Assuming VitsModel is the class for this TTS model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model_name = "Benjamin-png/swahili-mms-tts-finetuned"
+text = "Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili."
+audio_file_path = "swahili_speech.wav"
+# Load model and tokenizer dynamically based on the provided model name
+model = VitsModel.from_pretrained(model_name).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Step 1: Tokenize the input text
+inputs = tokenizer(text, return_tensors="pt").to(device)
+# Step 2: Generate waveform
+with torch.no_grad():
+    output = model(**inputs).waveform
+# Step 3: Convert PyTorch tensor to NumPy array
+output_np = output.squeeze().cpu().numpy()
+# Step 4: Write to WAV file
+scipy.io.wavfile.write(audio_file_path, rate=model.config.sampling_rate, data=output_np)
+```
+### Saving and Playing the Audio
+To save and play the audio, you can use the same methods mentioned above:
+#### Saving the Audio
+```python
+import soundfile as sf
+# Save the audio as a WAV file
+sf.write("swahili_speech.wav", output_np, model.config.sampling_rate)
+```
+#### Playing the Audio
+You can play the audio using `pydub`:
+```python
+from pydub import AudioSegment
+from pydub.playback import play
+# Load and play the generated audio
+audio = AudioSegment.from_wav("swahili_speech.wav")
+play(audio)
+```
+Make sure to install the required libraries:
+```bash
+pip install torch transformers numpy soundfile scipy pydub
+```
 ## Example Notebook
 If you're interested in reproducing the fine-tuning process or using the model for similar purposes, you can check out the Google Colab notebook that outlines the entire process:
 ## License
+This project is licensed under the terms of the Apache License 2.0.