About filler words detection (a.k.a. full breaks / disfluency words e.g. eh, umm, ahh)

#30
by rmajasol - opened

Hi, i've tested the model pronouncing some of that filler sounds (eh, umm, ahh, among others), but none were detected in the text. Is there some parameter to adjust that, or is it necessary to "fine tune" model training with new datasets containing that sounds?

Thank you!

I agree with that. So, some of filler words, stammerted words and repetition of same words weren't detected in the text.
I think It is caused for making much better context by model itself.

Whisper paper, "Robust Speech Recognition via Large-Scale Weak Supervision", page 21, Appendix C, Text Standardization:

"...We perform the following steps to normalize English texts in different styles into a standardized form, which is a best-effort attempt to penalize only when a word error is caused by actually mistranscribing a word, and not by formatting or punctuation differences.

[Two entries omitted]

  1. Remove any of the following words: hmm, mm, mhm, mmm, uh, um"

In other words, those are effectively filtered out.

Agreed, but is there anyway to add them back in? A flag for it would be useful when I’m using the disfluency timing to edit the audio or associated video.

Agreed, but is there anyway to add them back in? A flag for it would be useful when I’m using the disfluency timing to edit the audio or associated video.

Yes, it would be very useful for cases such as giving feedback about different spoken audio aspects that could be detected by whisper. Specially for educational purposes such as training speaking skills and getting feedback about what filler words were pronounced.

If you're using the model + processor, you can set normalize=False in the processor to skip the entire text normalisation step:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text with normalisation
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)
print(transcription)
# decode token ids to text without normalisation
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)
print(transcription)

Print Output:

['mister quilter is the apostle of the middle classes and we are glad to welcome his gospel']
[' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']

Here's a little trick: I prompted whisper with "So uhm, yeaah. Okay, ehm, uuuh."

That caused it to transcribe these fill words, at least occasionally. Just for reference, I am using this whisper implementation: https://github.com/guillaumekln/faster-whisper with the "tiny.en" model.

checkout this model which was specifically designed with filler detection in mind:
https://huggingface.co./nyrahealth/CrisperWhisper

and the accompanying repo and paper:
https://github.com/nyrahealth/CrisperWhisper

You can now detect them using the parameter initial_prompt= "Umm, let me think like, hmm... Okay, here's what I'm, like, thinking.".
You can see it in the examples here https://platform.openai.com/docs/guides/speech-to-text/prompting

Sign up or log in to comment