SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 8 days ago • 118
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Paper • 2502.07617 • Published 17 days ago • 28
view article Article From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub 17 days ago • 49
view article Article From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages By Steveeeeeeen and 1 other • 17 days ago • 25
DepthPro Models Collection Depth Pro: Sharp Monocular Metric Depth in Less Than a Second • 4 items • Updated 21 days ago • 7
view article Article Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO) By ariG23498 • Jan 19 • 14
ViTPose Collection Collection for ViTPose models based on transformers implementation. • 10 items • Updated Jan 12 • 13
Segformer Collection Transformer-based semantic segmentation model by Nvidia • 15 items • Updated Jan 13 • 4
timm tiny test models Collection A collection of very small (~300-500k parameter) models at 160x160 resolution, for testing purposes. Trained on ImageNet-1k. • 13 items • Updated Oct 2, 2024 • 5
view article Article ColPali: Efficient Document Retrieval with Vision Language Models 👀 By manu • Jul 5, 2024 • 208