Update README.md

48b6e7f verified 4 days ago

5.77 kB

	---
	library_name: transformers
	tags: []
	---

	# Huggingface Implementation of AV-HuBERT on the MuAViC Dataset

	This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments.


	Key features of this repository include:

	- Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from [MuAViC](https://github.com/facebookresearch/muavic) repository.

	- Inference scripts: Easily pipelines using Huggingface’s interface.

	- Data preprocessing scripts: Including normalize frame rate, extract lips and audio.

	### Inference code

	```sh
	git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
	cd AV-HuBERT-S2S
	conda create -n avhuberts2s python=3.9
	conda activate avhuberts2s
	pip install -r requirements.txt
	python run_example.py
	```

	```python
	from src.model.avhubert2text import AV2TextForConditionalGeneration
	from src.dataset.load_data import load_feature
	from transformers import Speech2TextTokenizer
	import torch

	if __name__ == "__main__":
	# Choose language to run example
	AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
	language = "ru"
	assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"


	# Load model and tokenizer
	model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
	model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
	tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')

	model = model.cuda().eval()

	# Load example video and audio
	video_example = f"./example/video_processed/{language}_lip_movement.mp4"
	audio_example = f"./example/video_processed/{language}_audio.wav"
	if not os.path.exists(video_example) or not os.path.exists(audio_example):
	print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
	video_example = f"./example/video_processed/en_lip_movement.mp4"
	audio_example = f"./example/video_processed/en_audio.wav"

	# Load and process example
	sample = load_feature(
	video_example,
	audio_example
	)

	audio_feats = sample['audio_source'].cuda()
	video_feats = sample['video_source'].cuda()
	attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()

	# Generate text
	output = model.generate(
	audio_feats,
	attention_mask=attention_mask,
	video=video_feats,
	max_length=1024,
	)

	print(tokenizer.batch_decode(output, skip_special_tokens=True))
	```

	### Data preprocessing scripts

	```sh
	mkdir model-bin
	cd model-bin
	wget https://huggingface.co./nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
	wget https://huggingface.co./nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .

	# raw video only support 4:3 ratio now
	cp raw_video.mp4 ./example/

	python src/dataset/video_to_audio_lips.py
	```

	### Pretrained AVSR model

	<table align="center">
	<tr>
	<th>Languages</th>
	<th>Huggingface</th>
	</tr>
	<tr>
	<th>Arabic</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-ar">Checkpoint-AR</a></th>
	</tr>
	<tr>
	<th>German</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-de">Checkpoint-DE</a></th>
	</tr>
	<tr>
	<th>Greek</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-el">Checkpoint-EL</a></th>
	</tr>
	<tr>
	<th>English</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-en">Checkpoint-EN</a></th>
	</tr>
	<tr>
	<th>Spanish</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-es">Checkpoint-ES</a></th>
	</tr>
	<tr>
	<th>French</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-fr">Checkpoint-FR</a></th>
	</tr>
	<tr>
	<th>Italian</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-it">Checkpoint-IT</a></th>
	</tr>
	<tr>
	<th>Portuguese</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-pt">Checkpoint-PT</a></th>
	</tr>
	<tr>
	<th>Russian</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-ru">Checkpoint-RU</a></th>
	</tr>
	<tr>
	<th>Multilingual</th>
	<th><a href="https://huggingface.co./nguyenvulebinh/AV-HuBERT-MuAViC-multilingual">Checkpoint-ar_de_el_es_fr_it_pt_ru</a></th>
	</tr>
	</table>

	## Acknowledgments

	AV-HuBERT: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation.

	MuAViC Repository: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project

	## License

	CC-BY-NC 4.0

	## Citation

	```bibtex
	@article{anwar2023muavic,
	title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
	author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
	journal={arXiv preprint arXiv:2303.00628},
	year={2023}
	}
	```