Audio Spectrogram Transformer (fine-tuned on Speech Commands v2)

Audio Spectrogram Transformer (AST) model fine-tuned on Speech Commands v2. It was introduced in the paper AST: Audio Spectrogram Transformer by Gong et al. and first released in this repository.

Disclaimer: The team releasing Audio Spectrogram Transformer did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

The Audio Spectrogram Transformer is equivalent to ViT, but applied on audio. Audio is first turned into an image (as a spectrogram), after which a Vision Transformer is applied. The model gets state-of-the-art results on several audio classification benchmarks.

Usage

You can use the raw model for classifying audio into one of the Speech Commands v2 classes. See the documentation for more info.

Downloads last month
8,450
Safetensors
Model size
85.4M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MIT/ast-finetuned-speech-commands-v2

Finetunes
27 models
Quantizations
1 model

Dataset used to train MIT/ast-finetuned-speech-commands-v2

Spaces using MIT/ast-finetuned-speech-commands-v2 5

Evaluation results