Convnextv2 finetuned for camera level classification

Convnextv2 base-size model finetuned for the classification of camera angles. Cinescale dataset is used to finetune the model for 20 epochs.

Classifies an image into six classes: aerial, eye, ground, hip, knee, shoulder

Evaluation

On the test set (test.csv), the model has an accuracy of 89.82% and macro-f1 of 82.31%

How to use

from transformers import AutoModelForImageClassification
import torch
from torchvision.transforms import v2
from torchvision.io import read_image, ImageReadMode

model = AutoModelForImageClassification.from_pretrained("gullalc/convnextv2-base-22k-384-cinescale-level")
im_size = 384

# https://www.pexels.com/photo/aerial-view-of-city-buildings-8783146/
image = read_image("demo/level_demo.jpg", mode=ImageReadMode.RGB)

transform = v2.Compose([v2.Resize((im_size,im_size), antialias=True),
                            v2.ToDtype(torch.float32, scale=True),
                            v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

inputs = transform(image).unsqueeze(0)

with torch.no_grad():
    outputs = model(pixel_values=inputs)
    

predicted_label = model.config.id2label[torch.argmax(outputs.logits).item()]
print(predicted_label)
# --> aerial
Downloads last month
4
Safetensors
Model size
87.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.