SIGMA / MODEL_ZOO.md
Mohammadreza Salehidehnavi
Feat: Instructions has been added
c98a7cc

VideoMAE Model Zoo

Kinetics-400

Method Extra Data Backbone Epoch #Frame Pre-train Fine-tune Top-1 Top-5
VideoMAE no ViT-S 1600 16x5x3 script/log/checkpoint script/log/checkpoint 79.0 93.8
VideoMAE no ViT-B 800 16x5x3 script/log/checkpoint script/log/checkpoint
(w/o repeated aug)
80.0 94.4
VideoMAE no ViT-B 800 16x5x3 same as above TODO 81.0 94.8
VideoMAE no ViT-B 1600 16x5x3 script/log/checkpoint script/log/checkpoint 81.5 95.1
VideoMAE no ViT-L 1600 16x5x3 script/log/checkpoint script/log/checkpoint 85.2 96.8
VideoMAE no ViT-H 1600 16x5x3 script/log/checkpoint script/log/checkpoint 86.6 97.1

Something-Something V2

Method Extra Data Backbone Epoch #Frame Pre-train Fine-tune Top-1 Top-5
VideoMAE no ViT-S 2400 16x2x3 script/log/checkpoint script/log/checkpoint 66.8 90.3
VideoMAE no ViT-B 800 16x2x3 script/log/checkpoint script/log/checkpoint
(w/o repeated aug)
69.6 92.0
VideoMAE no ViT-B 2400 16x2x3 script/log/checkpoint script/log/checkpoint 70.8 92.4

UCF101

Method Extra Data Backbone Epoch #Frame Pre-train Fine-tune Top-1 Top-5
VideoMAE no ViT-B 3200 16x5x3 script/log/checkpoint script/log/checkpoint 91.3 98.5

Note:

  • We report the results of VideoMAE finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.
  • #Frame = #input_frame x #clip x #crop.
  • #input_frame means how many frames are input for model during the test phase.
  • #crop means spatial crops (e.g., 3 for left/right/center crop).
  • #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).