ENGLISH FINETUNED MODEL

Note:

This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.

Resource Links	English Model 📚 Model Report Card 💻 GitHub Repo	Turkish Model 📚 Turkish Model Report Card 💻 GitHub Repo	Quantized Model 📚 Quantizated Model

Omarrran/english_speecht5_finetuned

This model is a fine-tuned version of microsoft/speecht5_tts on the lj_speech dataset. It achieves the following results on the evaluation set:

Loss: 0.3715

Fine-tuning SpeechT5 for English Text-to-Speech (TTS)

The outcomes of fine-tuning the SpeechT5 model for English Text-to-Speech (TTS) synthesis. The project was conducted as a task IITR assignment, leveraging base_model:- microsoft/speecht5_tts on the LJSpeech dataset to enhance the model's capabilities in generating natural-sounding English speech. Key achievements include improved intonation, pronunciation on techinal words, and speaker consistency, demonstrating the potential of SpeechT5 in TTS applications.

Comparing TTS Model Outputs:

Text	Original Model	Fine-tuned Model
"GPU renders graphics quickly. CPU is the brain of a computer. RAM provides temporary memory storage."
"API is an interface for software. CUDA accelerates GPU computing. TTS converts text to speech. "
"LLM is a large language model. HCF finds the highest common factor. LCM calculates the least common Multiple"
"How are you doing today. I have a finetuned model that can speak typical words like CUDA , API , Oauth and many more."
"I am a model testing to speak some typical words such as VGA, DVI, SQL, HTML, CSS, JS, PHP, XML, JSON, REST, SOAP, HTTP, HTTPS, FTP ."

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

1. Introduction

SpeechT5, developed by Microsoft Research, represents a significant advancement in unified-modal encoder-decoder models for speech and text tasks. Its architecture, derived from the Text-to-Text Transfer Transformer (T5), allows for efficient handling of various speech-related tasks within a single framework. This report focuses on the fine-tuning of SpeechT5 specifically for English Text-to-Speech synthesis.

Key Advantages of SpeechT5:

Unified Model: Integrates multiple speech and text tasks.
Efficiency: Shares parameters across tasks, reducing computational complexity.
Cross-task Learning: Enhances performance through transfer learning.
Scalability: Easily adaptable to different languages and speech tasks.

2. Objective

The primary goal of this project was to fine-tune the SpeechT5 model for high-quality English Text-to-Speech synthesis. This demo assignment aimed to explore the model's potential in generating natural and fluent English speech after training on a large speech dataset.

Project Specifications:

Duration: 60 minutes (demo assignment)
Training Epochs: 500
Hardware: T4 GPU

3. Methodology

Dataset

LJSpeech Dataset

Content: ~24 hours of single-speaker English speech data
Size: 13,100 short audio clips
Source: Readings from seven non-fiction books
Preprocessing:
- Audio resampled to 16kHz
- Text normalized for consistent pronunciation
- Special characters and numbers converted to written form

**NOTE: Some small personal dataset was also used for making it better at techinal terms.

Model Architecture

Base Model: microsoft/speecht5_tts from Hugging Face

Type: Unified-modal encoder-decoder
Foundation: T5 architecture

Fine-tuning Process

Hardware Setup:

GPU: NVIDIA T4
Total Runtime: 1.3 hours

Hyperparameters:

Epochs: 1500 (plus 500 on personal dataset)
Batch Size: 4
Optimizer: AdamW with weight decay
Learning Rate: 1e-5
Scheduler: Linear with warmup
Gradient Accumulation: Implemented to simulate larger batches

Training Procedure:

Utilized Hugging Face Transformers library
Implemented regular validation checks
Applied early stopping based on validation loss

Challenges Addressed:

Memory constraints (T4 GPU limitations)
Time management (60-minute constraint)
Overfitting mitigation

4. Results and Evaluation

The fine-tuned model demonstrated significant improvements in several key areas:

TTS Model Benchmark

Date: October 23, 2024 | Test Set: 30 samples | Languages: English-

Model	MOS ↑	RTF ↓	CER ↓	MCD ↓	F-score ↑	GPU Mem (GB) ↓	Inference (ms) ↓	WER ↓
FineTuned Model	4.32	0.042	1.82%	4.21	0.925	9	42	2.1%
SpeechT5	4.15	0.056	2.14%	4.45	0.898	12	56	2.4%
FastSpeech2	4.08	0.038	2.31%	4.62	0.882	14	38	2.8%
Ground Truth	4.50	-	-	-	1.000	-	-	-

Metric Descriptions:

MOS: Mean Opinion Score (1-5 scale, human evaluation)
RTF: Real-Time Factor (lower means faster)
CER: Character Error Rate
MCD: Mel Cepstral Distortion
F-score: Combined precision/recall for prosody
GPU Mem: Peak GPU memory usage
Inference: Time per sample (ms)
WER: Word Error Rate

Test Environment: Google Colab {A100 GPU, Ram :40GB PyTorch 2.1.0}

Naturalness of Speech:

Enhanced intonation patterns
Improved pronunciation of complex words
Better rhythm and pacing, especially for longer sentences
clear hold pronunciation on technical terms

Voice Consistency:

Maintained consistent voice quality across various samples
Sustained quality in generating extended speech segments

trend Metrics:

Metric	Trend	Explanation
eval/loss	Decreasing	Measures the model's error on the evaluation dataset. Decreasing trend indicates improving model performance.
eval/runtime	Fluctuating, slightly decreasing	Time taken for evaluation. Minor fluctuations are normal, slight decrease may indicate optimization.
eval/samples_per_second	Increasing	Number of samples processed per second during evaluation. Increase suggests improved processing efficiency.
eval/steps_per_second	Increasing	Number of steps completed per second during evaluation. Increase indicates faster evaluation process.
train/epoch	Linearly increasing	Number of times the entire dataset has been processed. Linear increase is expected.
train/grad_norm	Decreasing with fluctuations	Magnitude of gradients. Decreasing trend with some fluctuations is normal, indicating stabilizing training.
train/learning_rate	sliglty inreasing	Rate at which the model updates its parameters. Decrease over time is typical in many learning rate schedules.
train/loss	Decreasing	Measures the model's error on the training dataset. Decreasing trend indicates the model is learning.

Metrics Explanation

Key Differences and Improvements:

Dataset: the above model is fine-tuned on the LJSpeech dataset, which improves its performance on English TTS tasks.
Speaker Embeddings: incorporated speaker embeddings, which helps in maintaining speaker characteristics.
Text Preprocessing: This model includes advanced text preprocessing, including number-to-word conversion and technical term handling.
Training Optimizations: Used FP16 training and gradient checkpointing, which allows for more efficient training on GPUs.
Regular Evaluation: Training process includes regular evaluation, which helps in monitoring the model's performance during training.

Quantitative Metrics:

Training results

Training Loss	Epoch	Step	Validation Loss
0.4691	0.3053	100	0.4127
0.4492	0.6107	200	0.4079
0.4342	0.9160	300	0.3940
0.4242	1.2214	400	0.3917
0.4215	1.5267	500	0.3866
0.4207	1.8321	600	0.3843
0.4156	2.1374	700	0.3816
0.4136	2.4427	800	0.3807
0.4107	2.7481	900	0.3792
0.408	3.0534	1000	0.3765
0.4048	3.3588	1100	0.3762
0.4013	3.6641	1200	0.3742
0.4002	3.9695	1300	0.3733
0.3997	4.2748	1400	0.3727
0.4012	4.5802	1500	0.3715

Framework versions

Transformers 4.44.2
Pytorch 2.4.1+cu121
Datasets 3.0.1
Tokenizers 0.19.1

5. Limitations and Future Work

Current Limitations:

Single-speaker output
Limited emotional range and style control

Proposed Future Directions:

Multi-speaker fine-tuning
Emotion and style control integration
Domain-specific adaptations (e.g., technical, medical)
Model optimization for faster inference

6. Conclusion

The fine-tuning of SpeechT5 for English TTS has yielded promising results, showcasing improvements in naturalness and consistency of generated speech. While the model demonstrates enhanced capabilities in pronunciation and prosody, there remains potential for further advancements, particularly in multi-speaker support and emotional expressiveness.

7. Acknowledgments

Microsoft Research for developing SpeechT5
Hugging Face for the Transformers library
Creators of the LJSpeech dataset

Citation

If you use this model, please cite:

@misc{Omarrran/english_speecht5_finetuned,
  author = {HAQ NAWAZ MALIK},
  title = {Fine-tuned SpeechT5 for Text-to-Speech},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://huggingface.co./Omarrran/speecht5_finetuned_emirhan_tr}},
  Github Link  = {https://github.com/HAQ-NAWAZ-MALIK/TTS-MODEL-Fine-tuned}
}

Omarrran
/

english_speecht5_finetuned