--- license: mit language: et tags: - audio - automatic-speech-recognition #widget: #- example_title: Librispeech sample 1 # src: https://cdn-media.huggingface.co/speech_samples/sample1.flac #- example_title: Librispeech sample 2 # src: https://cdn-media.huggingface.co/speech_samples/sample2.flac pipeline_tag: automatic-speech-recognition base_model: - openai/whisper-large-v3-turbo library_name: transformers --- ## Introduction This model is OpenAI Whisper large-v3-turbo, finetuned on ~770 hours of manually created subtitles from Estonian TV (ETV). Therefore, this model does not always create verbatim (word-by-word) subtitles but often rephrases the sentences and compresses text, especially in the case of spontaneous speech, hestitations, repetitions, etc. However, the length of the generated text chunks almost always conforms to the ETV subtitle requirements (48 characters per line). ## Usage It's a finetuned vesion of Whisper large-v3-turbo and can be therefore used via Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Accelerate to reduce the model loading time: ```bash pip install --upgrade pip pip install --upgrade transformers accelerate ``` The model can be used with the [`pipeline`](https://huggingface.co./docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary length: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline from datasets import load_dataset device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "TalTechNLP/whisper-large-v3-turbo-et-subs" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) audio = "sample.mp3" result = pipe(sample, generate_kwargs={"task": "transcribe", "language": "et"}) print(result) ``` ## Evaluation results TODO