Model card for vit_base_patch16_1024_128.audiomae_as2m
A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method.
- This is a port of AudioMAE ViT-B/16 weights for usage with
timm
. The naming convention is adopted from othertimm
's ViT models. - See the original repo here: https://github.com/facebookresearch/AudioMAE
- For the AudioSet-20k fine-tuned checkpoint, see https://huggingface.co./gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k
NOTE: this model does not have a classification head.
Model Details
- Model Type: Audio feature backbone
- Papers:
- Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
- Pretrain Dataset: AudioSet-2M
- Original: https://github.com/facebookresearch/AudioMAE
Model Usage
Audio Embeddings
import timm
import torch
import torch.nn.functional as F
from torchaudio.compliance import kaldi
# for fine-tuning, you can pass `num_classes={your number of classes}`
model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft", pretrained=True)
model = model.eval()
MEAN = -4.2677393
STD = 4.5689974
audio = torch.randn(1, 10 * 16_000) # make sure input is 16kHz
melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128) # shape (n_frames, 128)
# AudioMAE only accepts 1024-frame input
if melspec.shape[0] < 1024:
melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0]))
else:
melspec = melspec[:1024]
melspec = (melspec - MEAN) / (STD * 2)
melspec = melspec.view(1, 1, 1024, 128) # add batch dim and channel dim
output = model(melspec) # embeddings with shape (1, 768)
Citation
@inproceedings{huang2022amae,
title = {Masked Autoencoders that Listen},
author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
booktitle = {NeurIPS},
year = {2022}
}
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
- Downloads last month
- 243