tensorops commited on
Commit
b88b11d
1 Parent(s): fabb5e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -126
README.md CHANGED
@@ -1,126 +1,124 @@
1
- ---
2
- license: mit
3
- ---
4
- ---
5
- language:
6
- - th
7
- license: mit
8
- library_name: transformers
9
- tags:
10
- - whisper-event
11
- - generated_from_trainer
12
- datasets:
13
- - CMKL/Porjai-Thai-voice-dataset-central
14
- metrics:
15
- - wer
16
- base_model: biodatlab/whisper-th-medium-combined
17
- model-index:
18
- - name: Whisper Medium Thai Timestamp - biodatlab
19
- results:
20
- - task:
21
- type: automatic-speech-recognition
22
- name: Automatic Speech Recognition
23
- dataset:
24
- name: mozilla-foundation/common_voice_13_0 th
25
- type: mozilla-foundation/common_voice_13_0
26
- config: th
27
- split: test
28
- args: th
29
- metrics:
30
- - type: wer
31
- value: 15.57
32
- name: Wer
33
- ---
34
-
35
- # Whisper Medium (Thai) Timestamp
36
-
37
- This model is a fine-tuned version of [biodatlab/whisper-th-medium-combined](biodatlab/whisper-th-medium-combined) on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. It achieves the following results on the common-voice-13 test set:
38
- - WER: 15.57 (with Deepcut Tokenizer)
39
-
40
- ## Model description
41
-
42
- This model is designed to perform automatic speech recognition (ASR) for the Thai language, with the added capability of generating timestamps for the transcribed text. It's based on the Whisper medium architecture and has been fine-tuned on a specially crafted dataset to enable timestamp generation.
43
-
44
- Use the model with Hugging Face's `transformers` as follows:
45
-
46
- ```py
47
- from transformers import pipeline
48
- import torch
49
-
50
- MODEL_NAME = "biodatlab/whisper-th-medium-timestamp" # specify the model name
51
- lang = "th" # Thai language
52
-
53
- device = 0 if torch.cuda.is_available() else "cpu"
54
-
55
- pipe = pipeline(
56
- task="automatic-speech-recognition",
57
- model=MODEL_NAME,
58
- chunk_length_s=30,
59
- device=device,
60
- return_timestamps=True,
61
- )
62
- pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
63
- language=lang,
64
- task="transcribe"
65
- )
66
- result = pipe("audio.mp3", return_timestamps=True)
67
- text = result["text"]
68
- timestamps = result["chunks"]
69
- ```
70
-
71
- ## Intended uses & limitations
72
- This model is intended for Thai automatic speech recognition tasks, particularly where timestamp information is required. It can be used for transcribing Thai audio content, creating subtitles, or any application that needs to align text with specific time points in audio.
73
- The model's performance on speech recognition may be lower compared to non-timestamped versions due to the additional complexity of the task and the pseudo-timestamp generation method used in training.
74
- ## Training and evaluation data
75
- The model was trained on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. The dataset creation process involved the following steps:
76
-
77
- - Combining multiple short audio clips from the original dataset into longer audio segments (up to 30 seconds).
78
- - Adding environmental noises and silences between clips to simulate more realistic speech scenarios.
79
- - Generating pseudo-timestamps for the combined audio using a Voice Activity Detection (VAD) model (Silero VAD).
80
-
81
- This approach allowed us to create a dataset with longer, more diverse audio samples and approximate timestamp information, which is crucial for training a model capable of generating timestamps.
82
- ## Training procedure
83
- The model was fine-tuned using a custom training script that incorporates the following:
84
-
85
- - Mixed precision training (FP16)
86
- - Gradient accumulation
87
- - SpecAugment for data augmentation during training
88
-
89
- ## Training hyperparameters
90
- The following hyperparameters were used during training:
91
-
92
- learning_rate: 1e-05
93
- train_batch_size: 8
94
- eval_batch_size: 8
95
- gradient_accumulation_steps: 1
96
- num_train_iters: 50000
97
- warmup_steps: 50
98
- fp16: True
99
- optimizer: AdamW
100
- lr_scheduler_type: linear
101
-
102
- ## Framework versions
103
-
104
- Transformers 4.44.2
105
- Pytorch 2.4.1
106
- Datasets 3.0.0
107
- Tokenizers 0.20.0
108
-
109
- ## Performance and Limitations
110
- The WER (Word Error Rate) of 15.57 on the Common Voice 13 test set indicates good performance for Thai ASR. However, it's important to note that the timestamp generation model has a lower accuracy compared to the non-timestamped version of the model. This is due to several factors:
111
-
112
- - The use of pseudo-timestamps in training data, which are approximations based on VAD rather than precise human annotations.
113
- - The additional complexity of the timestamp prediction task, which requires the model to learn both transcription and temporal alignment.
114
- - Potential discrepancies between the VAD-generated timestamps and actual word boundaries in continuous speech.
115
-
116
- Users should be aware that while the timestamps provide a general indication of when words or phrases occur in the audio, they may not be as precise as manually annotated timestamps. The model's performance may also vary depending on the acoustic conditions, speaker variability, and the presence of background noise in the input audio.
117
- ## Citation
118
- If you use this model in your research or applications, please cite it as follows:
119
-
120
- @misc{biodatlab_whisper_th_medium_timestamp,
121
- author = {Atirut Boribalburephan, Zaw Htet Aung, Knot Pipatsrisawat, Titipat Achakulvisut},
122
- title = {Whisper Medium Thai Timestamp: A fine-tuned Whisper model for Thai automatic speech recognition with timestamp generation},
123
- year = 2024,
124
- publisher = {Hugging Face},
125
- howpublished = {\url{https://huggingface.co/biodatlab/whisper-th-medium-timestamp}}
126
- }
 
1
+ ---
2
+ language:
3
+ - th
4
+ license: mit
5
+ library_name: transformers
6
+ tags:
7
+ - whisper-event
8
+ - generated_from_trainer
9
+ datasets:
10
+ - CMKL/Porjai-Thai-voice-dataset-central
11
+ metrics:
12
+ - wer
13
+ base_model: biodatlab/whisper-th-medium-combined
14
+ model-index:
15
+ - name: Whisper Medium Thai Timestamp - biodatlab
16
+ results:
17
+ - task:
18
+ type: automatic-speech-recognition
19
+ name: Automatic Speech Recognition
20
+ dataset:
21
+ name: mozilla-foundation/common_voice_13_0 th
22
+ type: mozilla-foundation/common_voice_13_0
23
+ config: th
24
+ split: test
25
+ args: th
26
+ metrics:
27
+ - type: wer
28
+ value: 15.57
29
+ name: Wer
30
+ ---
31
+
32
+ # Whisper Medium (Thai) Timestamp
33
+
34
+ This model is a fine-tuned version of [biodatlab/whisper-th-medium-combined](biodatlab/whisper-th-medium-combined) on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. It achieves the following results on the common-voice-13 test set:
35
+ - WER: 15.57 (with Deepcut Tokenizer)
36
+
37
+ ## Model description
38
+
39
+ This model is designed to perform automatic speech recognition (ASR) for the Thai language, with the added capability of generating timestamps for the transcribed text. It's based on the Whisper medium architecture and has been fine-tuned on a specially crafted dataset to enable timestamp generation.
40
+
41
+ Use the model with Hugging Face's `transformers` as follows:
42
+
43
+ ```py
44
+ from transformers import pipeline
45
+ import torch
46
+
47
+ MODEL_NAME = "biodatlab/whisper-th-medium-timestamp" # specify the model name
48
+ lang = "th" # Thai language
49
+
50
+ device = 0 if torch.cuda.is_available() else "cpu"
51
+
52
+ pipe = pipeline(
53
+ task="automatic-speech-recognition",
54
+ model=MODEL_NAME,
55
+ chunk_length_s=30,
56
+ device=device,
57
+ return_timestamps=True,
58
+ )
59
+ pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
60
+ language=lang,
61
+ task="transcribe"
62
+ )
63
+ result = pipe("audio.mp3", return_timestamps=True)
64
+ text = result["text"]
65
+ timestamps = result["chunks"]
66
+ ```
67
+
68
+ ## Intended uses & limitations
69
+ This model is intended for Thai automatic speech recognition tasks, particularly where timestamp information is required. It can be used for transcribing Thai audio content, creating subtitles, or any application that needs to align text with specific time points in audio.
70
+ The model's performance on speech recognition may be lower compared to non-timestamped versions due to the additional complexity of the task and the pseudo-timestamp generation method used in training.
71
+ ## Training and evaluation data
72
+ The model was trained on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. The dataset creation process involved the following steps:
73
+
74
+ - Combining multiple short audio clips from the original dataset into longer audio segments (up to 30 seconds).
75
+ - Adding environmental noises and silences between clips to simulate more realistic speech scenarios.
76
+ - Generating pseudo-timestamps for the combined audio using a Voice Activity Detection (VAD) model (Silero VAD).
77
+
78
+ This approach allowed us to create a dataset with longer, more diverse audio samples and approximate timestamp information, which is crucial for training a model capable of generating timestamps.
79
+ ## Training procedure
80
+ The model was fine-tuned using a custom training script that incorporates the following:
81
+
82
+ - Mixed precision training (FP16)
83
+ - Gradient accumulation
84
+ - SpecAugment for data augmentation during training
85
+
86
+ ## Training hyperparameters
87
+ The following hyperparameters were used during training:
88
+
89
+ - learning_rate: 1e-05
90
+ - train_batch_size: 8
91
+ - eval_batch_size: 8
92
+ - gradient_accumulation_steps: 1
93
+ - num_train_iters: ~50000
94
+ - warmup_steps: 500
95
+ - fp16: True
96
+ - optimizer: AdamW
97
+ - lr_scheduler_type: linear
98
+
99
+ ## Framework versions
100
+
101
+ - Transformers 4.44.2
102
+ - Pytorch 2.4.1
103
+ - Datasets 3.0.0
104
+ - Tokenizers 0.20.0
105
+
106
+ ## Performance and Limitations
107
+ The WER (Word Error Rate) of 15.57 on the Common Voice 13 test set indicates good performance for Thai ASR. However, it's important to note that the timestamp generation model has a lower accuracy compared to the non-timestamped version of the model. This is due to several factors:
108
+
109
+ - The use of pseudo-timestamps in training data, which are approximations based on VAD rather than precise human annotations.
110
+ - The additional complexity of the timestamp prediction task, which requires the model to learn both transcription and temporal alignment.
111
+ - Potential discrepancies between the VAD-generated timestamps and actual word boundaries in continuous speech.
112
+
113
+ Users should be aware that while the timestamps provide a general indication of when words or phrases occur in the audio, they may not be as precise as manually annotated timestamps. The model's performance may also vary depending on the acoustic conditions, speaker variability, and the presence of background noise in the input audio.
114
+ ## Citation
115
+ If you use this model in your research or applications, please cite it as follows:
116
+ ```
117
+ @misc{biodatlab_whisper_th_medium_timestamp,
118
+ author = {Atirut Boribalburephan, Zaw Htet Aung, Knot Pipatsrisawat, Titipat Achakulvisut},
119
+ title = {Whisper Medium Thai Timestamp: A fine-tuned Whisper model for Thai automatic speech recognition with timestamp generation},
120
+ year = 2024,
121
+ publisher = {Hugging Face},
122
+ howpublished = {\url{https://huggingface.co/biodatlab/whisper-th-medium-timestamp}}
123
+ }
124
+ ```