On-Demand Audio Transcription using Public Infrastructure
OpenAI's Whisper model has a fairly remarkable ability to transcribe audio accurately. Leveraging such models in production usually requires dedicated infrastructure, which has certain cost implications. Hugging Face supports short-form audio transcription for smaller audio files using publicly available infrastructure, but most audio files exceed the 30-second threshold.
To leverage this public infrastructure for the full transcription of longer-form audio transcription without spinning up a dedicated inference endpoint, I've built an on-demand transcription app that splits audio files into manageable chunks, processes them with Whisper, and generates both a full transcription and a concise summary of the underlying audio.
Here’s how the app works and how you can use it to handle audio files up to 5 minutes long (please note, that the length is arbitrary, and just one I've chosen for the sake of this demo).
Challenges of Longer-Form Audio
Without a dedicated inference endpoint, the publicly hosted Whisper base only supports processing audio segments of up to 30 seconds. Longer-form capabilities, which can be computationally expensive, require dedicated infrastructure that truly productising an application such as this would certainly require.
To support the handling of larger audio files, I've implemented a chunking mechanism to split the audio into 30-second snippets, process each snippet individually, and then merge the results. The fundamental trade off in using this architecture is that we trade longer processing time for less expensive operation.
Chunking Process
To manage audio files efficiently, I utilised open-source audio libraries Librosa and Soundfile:
- Splitting the Audio: The audio file is loaded using
librosa.load()
, which extracts both the audio data and its sampling rate. - Dividing into Chunks: The audio is divided into 30-second segments. Each chunk corresponds to a specific range of audio samples, calculated based on the sampling rate.
- Saving Temporary Chunks: Each 30-second chunk is saved temporarily as a WAV file using
soundfile.write()
.
This chunking approach ensures the app can handle larger files without overloading the publicly available endpoint, while maintaining the accuracy of transcription.
All Open Source Tools
This app harnesses the power of open-source tools, including:
- Hugging Face Transformers: For the Whisper model and text summarization pipeline.
- Gradio: To create a user-friendly interface for uploading audio files and displaying results.
- Librosa and Soundfile: For efficient audio processing and chunk management.
The combination of these tools enables developers to build robust and scalable AI-driven applications with minimal effort.
How the App Works
- Audio Upload: Users upload their audio file via a simple web interface. The app also supports the ad hoc creation of an audio file to process using the out-of-the-box Gradio tools.
- Chunk Processing: The app splits the audio into 30-second chunks and transcribes each using Whisper.
- Summary Generation: A concise summary of the transcription is created using the Hugging Face transformers library summarisation pipeline.
- Results Display: The full transcription and summary are displayed side-by-side and available to be copied and pasted for use outside the app.
Conclusion
This app demonstrates how to overcome the limitations of certain publicly speech recognition tools by chunking audio files and leveraging the flexibility of open-source libraries. By integrating cutting-edge models like Whisper with accessible tools like Gradio hosted used Hugging Face's flexible compute resources, we’ve created a powerful solution for transcription and summarisation of longer-form audio files.
Try it here: https://huggingface.co./spaces/ZennyKenny/AudioTranscribe