Audio-to-Text / README.md
Blandskron's picture
Update README.md
0123ce7 verified
|
raw
history blame
3.24 kB

Audio Transcription App

This application leverages Hugging Face's facebook/wav2vec2-large-xlsr-53-spanish model to transcribe audio files. It provides a simple web interface where users can upload audio recordings, such as meeting recordings, and receive a full transcription.


Features

  • Automatic Speech Recognition (ASR): Utilizes Hugging Face's pre-trained model for high-quality Spanish transcriptions.
  • Supports Long Audios: Automatically splits long audio files into smaller chunks for processing.
  • Web Interface: Provides a user-friendly interface using Gradio.
  • Flexible Audio Upload: Accepts common audio formats like WAV and MP3.

Installation

1. Clone the Repository

git clone <repository_url>
cd <repository_folder>

2. Install Dependencies

Ensure you have Python 3.7 or higher installed. Then, run:

pip install -r requirements.txt

3. Install FFmpeg

This application uses pydub, which requires FFmpeg. Follow these steps to install it:

  • Windows: Download FFmpeg from https://ffmpeg.org/download.html. Add the bin directory to your PATH.
  • MacOS: Use Homebrew:
    brew install ffmpeg
    
  • Linux: Install via your package manager, e.g.,
    sudo apt install ffmpeg
    

Usage

1. Run the Application

Start the app by running:

python app.py

2. Open the Web Interface

Once the app starts, it will provide a local URL (e.g., http://127.0.0.1:7860/). Open this URL in your web browser.

3. Upload an Audio File

  • Click on the upload button to select an audio file.
  • Supported formats: WAV, MP3, etc.

4. Get the Transcription

  • Once the audio is processed, the transcription will appear in the text box.
  • You can copy the transcription for further use.

File Structure

  • app.py: Main application script.
  • requirements.txt: List of dependencies.
  • chunks/: Temporary folder where audio chunks are stored during processing.
  • transcripcion.txt: File where the full transcription is saved after processing.

Customization

Adjust Chunk Length

The default chunk length is set to 30 seconds. You can adjust this by modifying the chunk_length_ms parameter in the app.py file:

chunk_length_ms = 30000  # Change to desired length in milliseconds

Limitations

  • Language: The model is optimized for Spanish audio. Performance may vary with other languages.
  • Audio Quality: Poor-quality audio may result in less accurate transcriptions.
  • Performance: Processing very large files may take some time, depending on your system.

Dependencies

  • transformers: For ASR model.
  • torch: Backend for model computations.
  • pydub: For audio splitting.
  • ffmpeg: Required by pydub for audio processing.
  • gradio: To create the web interface.

Install them using:

pip install -r requirements.txt

License

This project is licensed under the MIT License. See the LICENSE file for details.


Acknowledgments

  • Hugging Face for providing pre-trained models.
  • Gradio for the simple interface framework.
  • Pydub and FFmpeg for audio processing tools.