samtrack / aot /MODEL_ZOO.md
aikenml's picture
Upload folder using huggingface_hub
c985ba4

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

Model Zoo and Results

Environment and Settings

  • 4/1 NVIDIA V100 GPUs for training/evaluation.
  • Auto-mixed precision was enabled in training but disabled in evaluation.
  • Test-time augmentations were not used.
  • The inference resolution of DAVIS/YouTube-VOS was 480p/1.3x480p as CFBI.
  • Fully online inference. We passed all the modules frame by frame.
  • Multi-object FPS was recorded instead of single-object one.

Pre-trained Models

Stages:

  • PRE: the pre-training stage with static images.

  • PRE_YTB_DAV: the main-training stage with YouTube-VOS and DAVIS. All the kinds of evaluation share an identical model and the same parameters.

Model Param (M) PRE PRE_YTB_DAV
AOTT 5.7 gdrive gdrive
AOTS 7.0 gdrive gdrive
AOTB 8.3 gdrive gdrive
AOTL 8.3 gdrive gdrive
R50-AOTL 14.9 gdrive gdrive
SwinB-AOTL 65.4 gdrive gdrive
Model Param (M) PRE PRE_YTB_DAV
DeAOTT 7.2 gdrive gdrive
DeAOTS 10.2 gdrive gdrive
DeAOTB 13.2 gdrive gdrive
DeAOTL 13.2 gdrive gdrive
R50-DeAOTL 19.8 gdrive gdrive
SwinB-DeAOTL 70.3 gdrive gdrive

To use our pre-trained model to infer, a simple way is to set --model and --ckpt_path to your downloaded checkpoint's model type and file path when running eval.py.

YouTube-VOS 2018 val

ALL-F: all frames. The default evaluation setting of YouTube-VOS is 6fps, but 30fps sequences (all the frames) are also supplied by the dataset organizers. We noticed that many VOS methods prefer to evaluate with 30fps videos. Thus, we also supply our results here. Denser video sequences can significantly improve VOS performance when using the memory reading strategy (like AOTL, R50-AOTL, and SwinB-AOTL), but the efficiency will be influenced since more memorized frames are stored for object matching.

Model Stage FPS All-F Mean J Seen F Seen J Unseen F Unseen Predictions
AOTT PRE_YTB_DAV 41.0 80.2 80.4 85.0 73.6 81.7 gdrive
AOTT PRE_YTB_DAV 41.0 √ 80.9 80.0 84.7 75.2 83.5 gdrive
DeAOTT PRE_YTB_DAV 53.4 82.0 81.6 86.3 75.8 84.2 -
AOTS PRE_YTB_DAV 27.1 82.9 82.3 87.0 77.1 85.1 gdrive
AOTS PRE_YTB_DAV 27.1 √ 83.0 82.2 87.0 77.3 85.7 gdrive
DeAOTS PRE_YTB_DAV 38.7 84.0 83.3 88.3 77.9 86.6 -
AOTB PRE_YTB_DAV 20.5 84.0 83.2 88.1 78.0 86.5 gdrive
AOTB PRE_YTB_DAV 20.5 √ 84.1 83.6 88.5 78.0 86.5 gdrive
DeAOTB PRE_YTB_DAV 30.4 84.6 83.9 88.9 78.5 87.0 -
AOTL PRE_YTB_DAV 16.0 84.1 83.2 88.2 78.2 86.8 gdrive
AOTL PRE_YTB_DAV 6.5 √ 84.5 83.7 88.8 78.4 87.1 gdrive
DeAOTL PRE_YTB_DAV 24.7 84.8 84.2 89.4 78.6 87.0 -
R50-AOTL PRE_YTB_DAV 14.9 84.6 83.7 88.5 78.8 87.3 gdrive
R50-AOTL PRE_YTB_DAV 6.4 √ 85.5 84.5 89.5 79.6 88.2 gdrive
R50-DeAOTL PRE_YTB_DAV 22.4 86.0 84.9 89.9 80.4 88.7 -
SwinB-AOTL PRE_YTB_DAV 9.3 84.7 84.5 89.5 78.1 86.7 gdrive
SwinB-AOTL PRE_YTB_DAV 5.2 √ 85.1 85.1 90.1 78.4 86.9 gdrive
SwinB-DeAOTL PRE_YTB_DAV 11.9 86.2 85.6 90.6 80.0 88.4 -

YouTube-VOS 2019 val

Model Stage FPS All-F Mean J Seen F Seen J Unseen F Unseen Predictions
AOTT PRE_YTB_DAV 41.0 80.0 79.8 84.2 74.1 82.1 gdrive
AOTT PRE_YTB_DAV 41.0 √ 80.9 79.9 84.4 75.6 83.8 gdrive
DeAOTT PRE_YTB_DAV 53.4 82.0 81.2 85.6 76.4 84.7 -
AOTS PRE_YTB_DAV 27.1 82.7 81.9 86.5 77.3 85.2 gdrive
AOTS PRE_YTB_DAV 27.1 √ 82.8 81.9 86.5 77.3 85.6 gdrive
DeAOTS PRE_YTB_DAV 38.7 83.8 82.8 87.5 78.1 86.8 -
AOTB PRE_YTB_DAV 20.5 84.0 83.1 87.7 78.5 86.8 gdrive
AOTB PRE_YTB_DAV 20.5 √ 84.1 83.3 88.0 78.2 86.7 gdrive
DeAOTB PRE_YTB_DAV 30.4 84.6 83.5 88.3 79.1 87.5 -
AOTL PRE_YTB_DAV 16.0 84.0 82.8 87.6 78.6 87.1 gdrive
AOTL PRE_YTB_DAV 6.5 √ 84.2 83.0 87.8 78.7 87.3 gdrive
DeAOTL PRE_YTB_DAV 24.7 84.7 83.8 88.8 79.0 87.2 -
R50-AOTL PRE_YTB_DAV 14.9 84.4 83.4 88.1 78.7 87.2 gdrive
R50-AOTL PRE_YTB_DAV 6.4 √ 85.3 83.9 88.8 79.9 88.5 gdrive
R50-DeAOTL PRE_YTB_DAV 22.4 85.9 84.6 89.4 80.8 88.9 -
SwinB-AOTL PRE_YTB_DAV 9.3 84.7 84.0 88.8 78.7 87.1 gdrive
SwinB-AOTL PRE_YTB_DAV 5.2 √ 85.3 84.6 89.5 79.3 87.7 gdrive
SwinB-DeAOTL PRE_YTB_DAV 11.9 86.1 85.3 90.2 80.4 88.6 -

DAVIS-2017 test

Model Stage FPS Mean J Score F Score Predictions
AOTT PRE_YTB_DAV 51.4 73.7 70.0 77.3 gdrive
AOTS PRE_YTB_DAV 40.0 75.2 71.4 78.9 gdrive
AOTB PRE_YTB_DAV 29.6 77.4 73.7 81.1 gdrive
AOTL PRE_YTB_DAV 18.7 79.3 75.5 83.2 gdrive
R50-AOTL PRE_YTB_DAV 18.0 79.5 76.0 83.0 gdrive
SwinB-AOTL PRE_YTB_DAV 12.1 82.1 78.2 85.9 gdrive

DAVIS-2017 val

Model Stage FPS Mean J Score F Score Predictions
AOTT PRE_YTB_DAV 51.4 79.2 76.5 81.9 gdrive
AOTS PRE_YTB_DAV 40.0 82.1 79.3 84.8 gdrive
AOTB PRE_YTB_DAV 29.6 83.3 80.6 85.9 gdrive
AOTL PRE_YTB_DAV 18.7 83.6 80.8 86.3 gdrive
R50-AOTL PRE_YTB_DAV 18.0 85.2 82.5 87.9 gdrive
SwinB-AOTL PRE_YTB_DAV 12.1 85.9 82.9 88.9 gdrive

DAVIS-2016 val

Model Stage FPS Mean J Score F Score Predictions
AOTT PRE_YTB_DAV 51.4 87.5 86.5 88.4 gdrive
AOTS PRE_YTB_DAV 40.0 89.6 88.6 90.5 gdrive
AOTB PRE_YTB_DAV 29.6 90.9 89.6 92.1 gdrive
AOTL PRE_YTB_DAV 18.7 91.1 89.5 92.7 gdrive
R50-AOTL PRE_YTB_DAV 18.0 91.7 90.4 93.0 gdrive
SwinB-AOTL PRE_YTB_DAV 12.1 92.2 90.6 93.8 gdrive