Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.09596

Video Understanding

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 25
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 73
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 44
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 21

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Paper • 2311.17049 • Published Nov 28, 2023 • 1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7, 2024 • 14
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

Paper • 2303.17376 • Published Mar 30, 2023
Sigmoid Loss for Language Image Pre-Training

Paper • 2303.15343 • Published Mar 27, 2023 • 6

about 1 month ago

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Paper • 2403.10516 • Published Mar 15, 2024 • 16
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 92

OneLLM: One Framework to Align All Modalities with Language

Paper • 2312.03700 • Published Dec 6, 2023 • 20
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Paper • 2402.03162 • Published Feb 5, 2024 • 17
Rolling Diffusion Models

Paper • 2402.09470 • Published Feb 12, 2024 • 10
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19, 2024 • 41

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs