Audio-to-Text / README.md
Blandskron's picture
Update README.md
0123ce7 verified
|
raw
history blame
3.24 kB
# Audio Transcription App
This application leverages Hugging Face's `facebook/wav2vec2-large-xlsr-53-spanish` model to transcribe audio files. It provides a simple web interface where users can upload audio recordings, such as meeting recordings, and receive a full transcription.
---
## Features
- **Automatic Speech Recognition (ASR):** Utilizes Hugging Face's pre-trained model for high-quality Spanish transcriptions.
- **Supports Long Audios:** Automatically splits long audio files into smaller chunks for processing.
- **Web Interface:** Provides a user-friendly interface using Gradio.
- **Flexible Audio Upload:** Accepts common audio formats like WAV and MP3.
---
## Installation
### 1. Clone the Repository
```bash
git clone <repository_url>
cd <repository_folder>
```
### 2. Install Dependencies
Ensure you have Python 3.7 or higher installed. Then, run:
```bash
pip install -r requirements.txt
```
### 3. Install FFmpeg
This application uses `pydub`, which requires FFmpeg. Follow these steps to install it:
- **Windows:** Download FFmpeg from [https://ffmpeg.org/download.html](https://ffmpeg.org/download.html). Add the `bin` directory to your PATH.
- **MacOS:** Use Homebrew:
```bash
brew install ffmpeg
```
- **Linux:** Install via your package manager, e.g.,
```bash
sudo apt install ffmpeg
```
---
## Usage
### 1. Run the Application
Start the app by running:
```bash
python app.py
```
### 2. Open the Web Interface
Once the app starts, it will provide a local URL (e.g., `http://127.0.0.1:7860/`). Open this URL in your web browser.
### 3. Upload an Audio File
- Click on the upload button to select an audio file.
- Supported formats: WAV, MP3, etc.
### 4. Get the Transcription
- Once the audio is processed, the transcription will appear in the text box.
- You can copy the transcription for further use.
---
## File Structure
- `app.py`: Main application script.
- `requirements.txt`: List of dependencies.
- `chunks/`: Temporary folder where audio chunks are stored during processing.
- `transcripcion.txt`: File where the full transcription is saved after processing.
---
## Customization
### Adjust Chunk Length
The default chunk length is set to 30 seconds. You can adjust this by modifying the `chunk_length_ms` parameter in the `app.py` file:
```python
chunk_length_ms = 30000 # Change to desired length in milliseconds
```
---
## Limitations
- **Language:** The model is optimized for Spanish audio. Performance may vary with other languages.
- **Audio Quality:** Poor-quality audio may result in less accurate transcriptions.
- **Performance:** Processing very large files may take some time, depending on your system.
---
## Dependencies
- `transformers`: For ASR model.
- `torch`: Backend for model computations.
- `pydub`: For audio splitting.
- `ffmpeg`: Required by `pydub` for audio processing.
- `gradio`: To create the web interface.
Install them using:
```bash
pip install -r requirements.txt
```
---
## License
This project is licensed under the MIT License. See the LICENSE file for details.
---
## Acknowledgments
- Hugging Face for providing pre-trained models.
- Gradio for the simple interface framework.
- Pydub and FFmpeg for audio processing tools.