File size: 5,768 Bytes
01a898b c73b006 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
library_name: transformers
tags: []
---
# Huggingface Implementation of AV-HuBERT on the MuAViC Dataset
This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments.
Key features of this repository include:
- Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from [MuAViC](https://github.com/facebookresearch/muavic) repository.
- Inference scripts: Easily pipelines using Huggingface’s interface.
- Data preprocessing scripts: Including normalize frame rate, extract lips and audio.
### Inference code
```sh
git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt
python run_example.py
```
```python
from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch
if __name__ == "__main__":
# Choose language to run example
AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
language = "ru"
assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
# Load model and tokenizer
model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
model = model.cuda().eval()
# Load example video and audio
video_example = f"./example/video_processed/{language}_lip_movement.mp4"
audio_example = f"./example/video_processed/{language}_audio.wav"
if not os.path.exists(video_example) or not os.path.exists(audio_example):
print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
video_example = f"./example/video_processed/en_lip_movement.mp4"
audio_example = f"./example/video_processed/en_audio.wav"
# Load and process example
sample = load_feature(
video_example,
audio_example
)
audio_feats = sample['audio_source'].cuda()
video_feats = sample['video_source'].cuda()
attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
# Generate text
output = model.generate(
audio_feats,
attention_mask=attention_mask,
video=video_feats,
max_length=1024,
)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
```
### Data preprocessing scripts
```sh
mkdir model-bin
cd model-bin
wget https://huggingface.co./nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co./nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .
# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/
python src/dataset/video_to_audio_lips.py
```
### Pretrained AVSR model
<table align="center">
<tr>
<th>Languages</th>
<th>Huggingface</th>
</tr>
<tr>
<th>Arabic</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-ar">Checkpoint-AR</a></th>
</tr>
<tr>
<th>German</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-de">Checkpoint-DE</a></th>
</tr>
<tr>
<th>Greek</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-el">Checkpoint-EL</a></th>
</tr>
<tr>
<th>English</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-en">Checkpoint-EN</a></th>
</tr>
<tr>
<th>Spanish</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-es">Checkpoint-ES</a></th>
</tr>
<tr>
<th>French</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-fr">Checkpoint-FR</a></th>
</tr>
<tr>
<th>Italian</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-it">Checkpoint-IT</a></th>
</tr>
<tr>
<th>Portuguese</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-pt">Checkpoint-PT</a></th>
</tr>
<tr>
<th>Russian</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-ru">Checkpoint-RU</a></th>
</tr>
<tr>
<th>Multilingual</th>
<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-multilingual">Checkpoint-ar_de_el_es_fr_it_pt_ru</a></th>
</tr>
</table>
## Acknowledgments
**AV-HuBERT**: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation.
**MuAViC Repository**: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project
## License
CC-BY-NC 4.0
## Citation
```bibtex
@article{anwar2023muavic,
title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
journal={arXiv preprint arXiv:2303.00628},
year={2023}
}
``` |