sanchit-gandhi HF staff reach-vb HF staff commited on
Commit
3d06185
1 Parent(s): 1ecca60

Update README.md (#126)

Browse files

- Update README.md (1eaca33abac3ee283e44e20ac60f50fc304a7f66)


Co-authored-by: Vaibhav Srivastav <[email protected]>

Files changed (1) hide show
  1. README.md +237 -20
README.md CHANGED
@@ -163,7 +163,7 @@ checkpoints are summarised in the following table with links to the models on th
163
 
164
  ## Usage
165
 
166
- Whisper `large-v3` is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
167
  install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
168
  audio dataset from the Hugging Face Hub:
169
 
@@ -172,11 +172,10 @@ pip install --upgrade pip
172
  pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
173
  ```
174
 
 
 
175
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
176
- class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe
177
- long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI
178
- (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should
179
- be set based on the specifications of your device:
180
 
181
  ```python
182
  import torch
@@ -258,42 +257,260 @@ result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "fren
258
  print(result["chunks"])
259
  ```
260
 
261
- ## Additional Speed & Memory Improvements
 
 
 
 
 
 
 
 
 
 
 
 
262
 
263
- You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
 
264
 
265
- ### Flash Attention
 
 
 
 
 
 
 
 
 
 
 
266
 
267
- We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
268
- To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
 
270
  ```
271
  pip install flash-attn --no-build-isolation
272
  ```
273
 
274
- and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
275
 
276
  ```diff
277
  - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
278
- + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
279
  ```
280
 
281
- ### Torch Scale-Product-Attention (SDPA)
282
 
283
- If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
284
- To do so, you first need to install optimum:
 
285
 
 
 
 
 
286
  ```
287
- pip install --upgrade optimum
288
- ```
289
 
290
- And then convert your model to a "BetterTransformer" model before using it:
 
 
 
 
291
 
292
  ```diff
293
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
294
- + model = model.to_bettertransformer()
295
  ```
296
 
 
 
 
 
 
 
 
 
 
 
297
  ## Fine-Tuning
298
 
299
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
 
163
 
164
  ## Usage
165
 
166
+ Whisper `large-v3` is supported in Hugging Face 🤗 Transformers. To run the model, first
167
  install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
168
  audio dataset from the Hugging Face Hub:
169
 
 
172
  pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
173
  ```
174
 
175
+ ### Short-Form Transcription
176
+
177
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
178
+ class to transcribe short-form audio files (< 30-seconds) as follows:
 
 
 
179
 
180
  ```python
181
  import torch
 
257
  print(result["chunks"])
258
  ```
259
 
260
+ <details>
261
+
262
+ <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
263
+
264
+ Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
265
+ for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
266
+ for more details.
267
+
268
+ ```python
269
+ import torch
270
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
271
+ from datasets import Audio, load_dataset
272
+
273
 
274
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
275
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
276
 
277
+ model_id = "openai/whisper-large-v3"
278
+
279
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
280
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
281
+ )
282
+ model.to(device)
283
+
284
+ processor = AutoProcessor.from_pretrained(model_id)
285
+
286
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
287
+ dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
288
+ sample = dataset[0]["audio"]
289
 
290
+ input_features = processor(
291
+ sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
292
+ ).input_features
293
+
294
+ input_features = input_features.to(device, dtype=torch_dtype)
295
+
296
+ gen_kwargs = {
297
+ "max_new_tokens": 128,
298
+ "num_beams": 1,
299
+ "return_timestamps": False,
300
+ }
301
+
302
+ pred_ids = model.generate(input_features, **gen_kwargs)
303
+ pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
304
+
305
+ print(pred_text)
306
+ ```
307
+
308
+ </details>
309
+
310
+ ### Sequential Long-Form
311
+
312
+ This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
313
+ and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
314
+
315
+ The sequential long-form algorithm should be used in either of the following scenarios:
316
+ 1. Transcription accuracy is the most important factor, and latency is less of a consideration
317
+ 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
318
+
319
+ The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
320
+ class can be used to transcribe long audio files with the sequential algorithm as follows:
321
+
322
+ ```python
323
+ import torch
324
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
325
+ from datasets import load_dataset
326
+
327
+
328
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
329
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
330
+
331
+ model_id = "openai/whisper-large-v3"
332
+
333
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
334
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
335
+ )
336
+ model.to(device)
337
+
338
+ processor = AutoProcessor.from_pretrained(model_id)
339
+
340
+ pipe = pipeline(
341
+ "automatic-speech-recognition",
342
+ model=model,
343
+ tokenizer=processor.tokenizer,
344
+ feature_extractor=processor.feature_extractor,
345
+ max_new_tokens=128,
346
+ torch_dtype=torch_dtype,
347
+ device=device,
348
+ )
349
+
350
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
351
+ sample = dataset[0]["audio"]
352
+
353
+ result = pipe(sample)
354
+ print(result["text"])
355
+ ```
356
+
357
+ <details>
358
+
359
+ <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
360
+
361
+ ```python
362
+ import torch
363
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
364
+ from datasets import Audio, load_dataset
365
+
366
+
367
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
368
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
369
+
370
+ model_id = "openai/whisper-large-v3"
371
+
372
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
373
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
374
+ )
375
+ model.to(device)
376
+
377
+ processor = AutoProcessor.from_pretrained(model_id)
378
+
379
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
380
+ dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
381
+ sample = dataset[0]["audio"]
382
+
383
+ inputs = processor(
384
+ sample["array"],
385
+ sampling_rate=sample["sampling_rate"],
386
+ return_tensors="pt",
387
+ truncation=False,
388
+ padding="longest",
389
+ return_attention_mask=True,
390
+ )
391
+ inputs = inputs.to(device, dtype=torch_dtype)
392
+
393
+ gen_kwargs = {
394
+ "max_new_tokens": 448,
395
+ "num_beams": 1,
396
+ "condition_on_prev_tokens": False,
397
+ "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
398
+ "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
399
+ "logprob_threshold": -1.0,
400
+ "no_speech_threshold": 0.6,
401
+ "return_timestamps": True,
402
+ }
403
+
404
+ pred_ids = model.generate(**i nputs, **gen_kwargs)
405
+ pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
406
+
407
+ print(pred_text)
408
+ ```
409
+
410
+ </details>
411
+
412
+ ### Chunked Long-Form
413
+
414
+ large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when
415
+ a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
416
+ the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the
417
+ [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
418
+
419
+ To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
420
+ is optimal. To activate batching over long audio files, pass the argument `batch_size`:
421
+
422
+ ```python
423
+ import torch
424
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
425
+ from datasets import load_dataset
426
+
427
+
428
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
429
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
430
+
431
+ model_id = "openai/whisper-large-v3"
432
+
433
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
434
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
435
+ )
436
+ model.to(device)
437
+
438
+ processor = AutoProcessor.from_pretrained(model_id)
439
+
440
+ pipe = pipeline(
441
+ "automatic-speech-recognition",
442
+ model=model,
443
+ tokenizer=processor.tokenizer,
444
+ feature_extractor=processor.feature_extractor,
445
+ max_new_tokens=128,
446
+ chunk_length_s=25,
447
+ batch_size=16,
448
+ torch_dtype=torch_dtype,
449
+ device=device,
450
+ )
451
+
452
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
453
+ sample = dataset[0]["audio"]
454
+
455
+ result = pipe(sample)
456
+ print(result["text"])
457
+ ```
458
+
459
+ ### Additional Speed & Memory Improvements
460
+
461
+ You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
462
+ requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
463
+ more efficient flash attention version.
464
+
465
+ #### Flash Attention 2
466
+
467
+ We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
468
+ if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
469
 
470
  ```
471
  pip install flash-attn --no-build-isolation
472
  ```
473
 
474
+ Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
475
 
476
  ```diff
477
  - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
478
+ + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
479
  ```
480
 
481
+ #### Torch Scale-Product-Attention (SDPA)
482
 
483
+ If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
484
+ This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
485
+ whether you have a compatible PyTorch version, run the following Python code snippet:
486
 
487
+ ```python
488
+ from transformers.utils import is_torch_sdpa_available
489
+
490
+ print(is_torch_sdpa_available())
491
  ```
 
 
492
 
493
+ If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
494
+ returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
495
+
496
+ Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
497
+ `attn_implementation="sdpa"` as follows:
498
 
499
  ```diff
500
+ - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
501
+ + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
502
  ```
503
 
504
+ For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
505
+
506
+ #### Torch compile
507
+
508
+ Coming soon...
509
+
510
+ #### 4-bit and 8-bit Inference
511
+
512
+ Coming soon...
513
+
514
  ## Fine-Tuning
515
 
516
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,