ViViT (Video Vision Transformer)

ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository.

Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

ViViT is an extension of the Vision Transformer (ViT) to video.

We refer to the paper for details.

Intended uses & limitations

The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

BibTeX entry and citation info

@misc{arnab2021vivit,
      title={ViViT: A Video Vision Transformer}, 
      author={Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario LučiΔ‡ and Cordelia Schmid},
      year={2021},
      eprint={2103.15691},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month
53,470
Inference API

Model tree for google/vivit-b-16x2-kinetics400

Finetunes
50 models

Spaces using google/vivit-b-16x2-kinetics400 12