HuggingFaceTB
/

SmolVLM2-500M-Video-Instruct

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

mfarre HF staff commited on 24 days ago

Commit

f730b73

·

verified ·

1 Parent(s): a3992d5

Update README.md

Files changed (1) hide show

README.md +11 -3

README.md CHANGED Viewed

@@ -4,6 +4,16 @@ license: apache-2.0
 datasets:
 - HuggingFaceM4/the_cauldron
 - HuggingFaceM4/Docmatix
 pipeline_tag: video-text-to-text
 language:
 - en
@@ -17,9 +27,7 @@ base_model:
 # SmolVLM2-500M-Video
-SmolVLM2-500M-Video is a model optimized for video that accepts video, arbitrary sequences of image and text inputs to produce text outputs. It can answer questions about media files, compare images,  describe visual content, or transcribe text.
-Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks. It can run inference on a video with 1.8GB of GPU RAM.
 ## Model Summary
 - **Developed by:** Hugging Face 🤗

 datasets:
 - HuggingFaceM4/the_cauldron
 - HuggingFaceM4/Docmatix
+- lmms-lab/LLaVA-OneVision-Data
+- lmms-lab/M4-Instruct-Data
+- HuggingFaceFV/finevideo
+- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
+- lmms-lab/LLaVA-Video-178K
+- orrzohar/Video-STaR
+- Mutonix/Vript
+- TIGER-Lab/VISTA-400K
+- Enxin/MovieChat-1K_train
+- ShareGPT4Video/ShareGPT4Video
 pipeline_tag: video-text-to-text
 language:
 - en
 # SmolVLM2-500M-Video
+SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
 ## Model Summary
 - **Developed by:** Hugging Face 🤗