VideoMAE finetuned for shot scale classification

videomae-base-finetuned-kinetics model finetuned to classify shot scale into five classes: ECS (Extreme close-up shot), CS (close-up shot), MS (medium shot), FS (full shot), LS (long shot)

Movienet dataset is used for finetuning the model for 5 epochs. v1_split_trailer.json provides the training, validation and test data splits.

Evaluation

Model achieves accuracy of 88.93% and macro-f1 of 89.19%

Class-wise accuracies: ECS - 91.16%, CS - 83.65, MS - 86.2%, FS - 90.74%, LS - 94.55%

How to use

This is how model can be tested on a shot/clip from a video. Same code is used to process, transform and evaluate on the movienet test set.

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
from pytorchvideo.transforms import ApplyTransformToKey
from torchvision.transforms import v2
from decord import VideoReader, cpu

## Evaluation Transform
transform = v2.Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=v2.Compose(
                [
                    v2.Lambda(lambda x: x.permute(0, 3, 1, 2)), # T, H, W, C -> T, C, H, W
                    v2.UniformTemporalSubsample(16),
                    v2.Resize(resize_to),
                    v2.Lambda(lambda x: x / 255.0),
                    v2.Normalize(img_mean, img_std)
                ]
            ),
        ),
    ]
)

## Preprocessor and Model loading
image_processor = VideoMAEImageProcessor.from_pretrained("gullalc/videomae-base-finetuned-kinetics-movieshots-scale")
model = VideoMAEForVideoClassification.from_pretrained("gullalc/videomae-base-finetuned-kinetics-movieshots-scale")

img_mean = image_processor.image_mean
img_std = image_processor.image_std
height = width = image_processor.size["shortest_edge"]
resize_to = (height, width)

## load video/clip and predict
video_path = "random_clip.mp4"
vr = VideoReader(video_path, width=480, height=270, ctx=cpu(0))
frames_tensor = torch.stack([torch.tensor(vr[i].asnumpy()) for i in range(len(vr))])  ## Shape: (T, H, W, C)

frames_tensor = transform({"video": frames_tensor})["video"]

output = model(pixel_values=frames_tensor)
pred = torch.argmax(outputs.logits, axis=1).cpu().numpy()

print(model.config.id2label[pred[0]])
Downloads last month
0
Safetensors
Model size
86.2M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.