arxiv:2408.00759

Text-Guided Video Masked Autoencoder

Published on Aug 1, 2024

Authors:

David Fan ,

Abstract

Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked reconstruction, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks, especially for linear probe. Within this unified framework, our TGM achieves the best relative performance on five action recognition and one egocentric datasets, highlighting the complementary nature of natural language for masked video modeling.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2408.00759 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2408.00759 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.00759 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.