Conversions to other formats

#1
by PierreMesure - opened

Hi,

Thank you so much for your work! 🤗

In order to use the model in different settings, I have already started converting it to optimised formats:

  • GGML to use it with MacWhisper (by far the best way to use a Whisper model, unfortunately only on Macs)
  • ONNX to use it with transformers.js in browsers (like this example with WebGPU)
  • CT2 to use it with Speaches and deploy it in an organisation through an API

I think these different formats could be useful for other people so I've started to publish them on HuggingFace. Do you have any guidelines regarding this? Or would you prefer to publish these variants yourself?

I think it would also be great to have quantified or distilled versions (using faster-whisper or distil-whisper) at some point but I'm not sure how relevant they are and I don't have the competence to do it myself. Is it something that you are considering or are you welcoming community initiatives?

__Another huge MacWhisper fan here - but I thought about it a bit different, so I wrote to Jordi; __

I love Macwhisper. I have even convinced the IT department of [redacted] that I should be allowed to run it on my work Mac. That's no small feat ;)

... and on that same Mac, now traveling, I want to try and convert KB-Whisper Large, on Hugging Face, to GGML so I can install it in Macwhisper - but I'm not allowed to pip install anything on this machine - so I can't run the conversion :(

But I think it would be fairly easy for a genius like you to add a field on the model page where I can add https://huggingface.co./KBLab/kb-whisper-large and you'll fix the rest for me (and once it's done once, every other (Swedish) user of MacWhisper can also use it...)

  • GGML to use it with MacWhisper (by far the best way to use a Whisper model, unfortunately only on Macs)

Surely this is the whisper.cpp ggml format? That's all platforms, not just macOS :)

https://github.com/ggerganov/whisper.cpp/

(And yes, that's how I use whisper models as well)

@troed What in our comments made you assume we thought GGML was exclusively for MacOS?
MacWhisper, however, is an app that's only avaible on MacOS, and it was used as an example.

@PierreMesure , I can't see the GGML versions posted in your repository? Pretty please, would love to have them.

@troed What in our comments made you assume we thought GGML was exclusively for MacOS?

hugs

I have converted the model to ggml - anyone who wants it can reach out to me :)

@PierreMesure

I think it would also be great to have quantified or distilled versions (using faster-whisper or distil-whisper) at some point but I'm not sure how relevant they are [...]

faster-whisper is used in rhasspy/wyoming-faster-whisper which powers the Home Assistant speech pipeline.

The conversion seems to be rather easy to do. I got it working by running the command below in a Docker container with ctranslate2 installed.

ct2-transformers-converter --model KBLab/kb-whisper-tiny --output_dir /var/data/kb-whisper-tiny-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16

I'm also interested in hearing how KB wants to handle "offsprings" to these models. If its up to the community to upload the variants?

National Library of Sweden / KBLab org

We prefer to host these formats ourselves in order to better be able to track usage statistics of the models.

I have just updated this repo to include faster-whisper, onnx and whisper.cpp compatible versions of the model. Will look at updating the README tomorrow with usage examples for each format.

faster-whisper should now work out of the box by just specifying this repo:

from faster_whisper import WhisperModel

model = WhisperModel(
    "KBLab/kb-whisper-large",
    device="cuda",
    compute_type="float16",
    download_root="cache-faster-whisper", # cache_dir
)

# Transcribe audio.wav
segments, info = model.transcribe("audio_mono.wav", condition_on_previous_text=False)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Onnx usage:

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "KBLab/kb-whisper-large", cache_dir="cache", subfolder="onnx"
)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    "KBLab/kb-whisper-large", cache_dir="cache", subfolder="onnx"
)

import soundfile as sf

audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
model._supports_cache_class = False
gen_tokens = model.generate(**inputs)
processor.decode(gen_tokens[0])

Whisper.cpp requires downloading files with wget which I guess won't count in usage statistics. Will update tomorrow with example.

Wow, @Lauler this is great! I'll try to use your ONNX variants instead of my own in whisper.mesu.re. I agree with you, I think it's better to keep all formats under your organisation for consistency and statistics.

I'm not sure if all software will be able to handle the many files harmoniously though, if they are all in the same folder. But I guess that's easy to test.

Regarding ONNX, you might want to use the convert script in the transformers.js library and to include quants. In addition, you should use the code from this PR or the model won't work with transformers.js.

python -m scripts.convert --quantize --model_id KBLab/kb-whisper-tiny
National Library of Sweden / KBLab org

Was inspired by NbAiLab/nb-whisper-large, that adds multiple formats to the same repo. transformers and faster-whisper seem clever enough to only download the relevant files depending on the backend you are using.

I converted the onnx using optimum with the following settings:

os.makedirs("onnx", exist_ok=True)
subprocess.run(
    [
        "optimum-cli",
        "export",
        "onnx",
        "--model",
        model_path,
        "--task",
        "automatic-speech-recognition-with-past",
        "onnx",
    ]
)

Seemed beneficial to be able to run inference with KV cache (automatic-speech-recognition-with-past). However, I guess this does not work with transformers.js.

It seems transformers.js should work if there's a subfolder called onnx in a repo where .onnx files are located.

Transformers.js supports loading any model hosted on the Hugging Face Hub, provided it has ONNX weights (located in a subfolder called onnx). For more information on how to convert your PyTorch, TensorFlow, or JAX model to ONNX, see the conversion section. source.

I need to test a bit whether the transformer.js conversion script also works with regular transformers before pushing changes.

I struggled several hours with optimum-cli and with transformers.js's conversion script. Make sure you use the code in the PR to have working models.
I would like to release whisper.mesu.re in the coming days so please convert tiny, base, small (or I could do it for you if you want?) so I can point to your repos. 😊

Thanks @Lauler for adding the ggml model! I'd also be happy to have access to the other model sizes as ggml, e.g. for mobile applications or slower computers.

@PierreMesure this website is extremely cool! Do you have the same already for other languages? :O

@mbroedl It’s just a fork of the whisper-web project by @Xenova , you can find the source code at the bottom of the page.

Xenova’s project was downloading OpenAI’s models (ONNX versions), the main thing I did was pointing it to KB’s models and translating it. I made a couple of other improvements which I’ll try to submit as PRs if @Xenova is interested (the project seems a bit stale). I now enabled GPU support but it’s disabled by default as it seems to be hit and miss with the quants.

By the way, the app is now fetching the models from KB’s repos so I can confirm the ONNX versions work just as well as mine. I’ll delete mine soon.

National Library of Sweden / KBLab org

I have added usage examples for faster-whisper, WhisperX, whisper.cpp and onnx in the README.

Every model has two GGML checkpoints. One without quantization and one with q5_0. I'm not too knowledgeable about what quantized versions are popular and perform well, so if you have specific requests let me know.

Try the different libraries/formats that are supported: https://huggingface.co./KBLab/kb-whisper-large#usage

Lauler changed discussion status to closed

This is great!
It would be great to add library_name: ctranslate2, this is how Speaches lists) what models can be used (so this is the way to make the models available in Speaches). Unfortunately, it doesn't seem to be possible to list several entries in library_name so I'm not sure what solution there is. I opened an issue on their repo.

National Library of Sweden / KBLab org

I have added ctranslate2 as a tag (see https://huggingface.co./docs/hub/en/model-cards#specifying-a-library). Maybe it can help them, if they perform a secondary check whether ctranslate2 exists among the tags of a repo.

Sign up or log in to comment