--- library_name: transformers tags: [] --- # Huggingface Implementation of AV-HuBERT on the MuAViC Dataset This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments. Key features of this repository include: - Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from [MuAViC](https://github.com/facebookresearch/muavic) repository. - Inference scripts: Easily pipelines using Huggingface’s interface. - Data preprocessing scripts: Including normalize frame rate, extract lips and audio. ### Inference code ```sh git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git cd AV-HuBERT-S2S conda create -n avhuberts2s python=3.9 conda activate avhuberts2s pip install -r requirements.txt python run_example.py ``` ```python from src.model.avhubert2text import AV2TextForConditionalGeneration from src.dataset.load_data import load_feature from transformers import Speech2TextTokenizer import torch if __name__ == "__main__": # Choose language to run example AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"] language = "ru" assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}" # Load model and tokenizer model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}" model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin') tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin') model = model.cuda().eval() # Load example video and audio video_example = f"./example/video_processed/{language}_lip_movement.mp4" audio_example = f"./example/video_processed/{language}_audio.wav" if not os.path.exists(video_example) or not os.path.exists(audio_example): print(f"WARNING: Example video and audio for {language} is not available english will be used instead") video_example = f"./example/video_processed/en_lip_movement.mp4" audio_example = f"./example/video_processed/en_audio.wav" # Load and process example sample = load_feature( video_example, audio_example ) audio_feats = sample['audio_source'].cuda() video_feats = sample['video_source'].cuda() attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda() # Generate text output = model.generate( audio_feats, attention_mask=attention_mask, video=video_feats, max_length=1024, ) print(tokenizer.batch_decode(output, skip_special_tokens=True)) ``` ### Data preprocessing scripts ```sh mkdir model-bin cd model-bin wget https://huggingface.co./nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy . wget https://huggingface.co./nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat . # raw video only support 4:3 ratio now cp raw_video.mp4 ./example/ python src/dataset/video_to_audio_lips.py ``` ### Pretrained AVSR model

Languages	Huggingface
Arabic	Checkpoint-AR
German	Checkpoint-DE
Greek	Checkpoint-EL
English	Checkpoint-EN
Spanish	Checkpoint-ES
French	Checkpoint-FR
Italian	Checkpoint-IT
Portuguese	Checkpoint-PT
Russian	Checkpoint-RU
Multilingual	Checkpoint-ar_de_el_es_fr_it_pt_ru

## Acknowledgments **AV-HuBERT**: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation. **MuAViC Repository**: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project ## License CC-BY-NC 4.0 ## Citation ```bibtex @article{anwar2023muavic, title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation}, author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan}, journal={arXiv preprint arXiv:2303.00628}, year={2023} } ```