How do I return timestamps on inference serverless api???

#47
by tshroveOntelio - opened

How do I return timestamps on inference serverless api???

***Beware, the timestamps reset after about 30seconds, and so are somewhat useless as is.

"from huggingface_hub.inference._common import _b64_encode"

^^^ (for converting audio first into a b64 encoding for this method to work)

using requests: "import requests"

Screenshot 2024-12-14 at 7.52.42 PM.png

Unfortunately, I have not have much luck with this as well. I wanted to enable verbose responses to get the detected language but so far, no luck.

First, I tried:

#################################################
files = {
"file": open(audio_file_path, "rb"),
"model": "openai/whisper-large-v3-turbo",
"response_format": "verbose_json",
"timestamp_granularities[]": "word",
}

response = requests.post(url, headers=headers, files=files)
#################################################

Results: Transcription was a success but I got only the transcribed text.

Next, I tried:

#################################################
options = {
"parameters":
{
"return_timestamps": True,
response_format:"verbose_json"
}
}

files = {
"file": ("audio.mp3", output_audio_io, "audio/mpeg")
}
response = requests.post(api_url, headers=headers, files=files, json=options)
#################################################

Results: Transcription was a success but I got only the transcribed text.

*** Update: using the helpful comment from @trystoh , I was able to get timestamp, though I am still trying to enable verbose

*** Update2: OOps, according to https://huggingface.co./docs/api-inference/tasks/automatic-speech-recognition, looks like I cannot get verbose responses... If anyone can point me to the right direction I would greatly appreciate it.

Understood, have you tried using a byte 64 encoded audio file?

In the docs it uses tricky logic like, “If not using parameters you can also just use an audio file directly” so you may have to first convert the to byte 64 encoding.

Let me know if I misunderstood, I spent all day on this 😂

yeah, when I use the byte64 encoded audio file, it tells me the payload is too big.

Sign up or log in to comment