Another great week in open ML! Here's a small recap π«°π»
Model releases β―οΈ Video Language Models AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
π¬ Small language models Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets. Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
πΌοΈπ¬Any-to-Any gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!
Dataset releases πΌοΈ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
Hello, researchers! I've tried to made reading HF Daily Papers easier and made a tool that does reviews with LLMs like Claude 3.5, GPT-4o and sometimes FLUX.
π Classification by topics π Sorting by publication date and HF addition date π Syncing every 2 hours π» Hosted on GitHub π English, Russian, and Chinese π Top by week/month (in progress)
Microsoft released a groundbreaking model that can be used for web automation, with MIT license π₯ microsoft/OmniParser
Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.
no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.
Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. π
OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.
Lotus πͺ· is a new foundation model on monocular depth estimation β¨ Compared to previous diffusion-based MDE models, Lotus is modified for dense prediction tasks Authors also released a model for normal prediction π€ Find everything in this collection merve/lotus-6718fb957dc1c85a47ca1210
It's raining depth estimation models βοΈ DepthPro is a zero-shot depth estimation model by Apple, it's fast, sharp and accurate π₯ Demo: akhaliq/depth-pro Model: apple/DepthPro Paper page: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073) The model consists of two encoders: an encoder for patches and an image encoder πΌοΈ The outputs of both are merged to decode to depth maps and get the focal length. The model outperforms the previous state-of-the-art models in average of various benchmarks π
This is not a drill π₯ HuggingChat is now multimodal with meta-llama/Llama-3.2-11B-Vision-Instruct! π€ This also comes with multimodal assistants, I have migrated my Marcus Aurelius advice assistant to Llama-Vision and Marcus can see now! π
Meta AI vision has been cooking @facebook They shipped multiple models and demos for their papers at @ECCVπ€
Here's a compilation of my top picks: - Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos π
All models have their demos and even torchscript checkpoints! A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc - VFusion3D is state-of-the-art consistent 3D generation model from images
If you feel like you missed out for ECCV 2024, there's an app to browse the papers, rank for popularity, filter for open models, datasets and demos π
ColPali is revolutionizing multimodal retrieval, but could it be even more effective with domain-specific fine-tuning?
Check out my latest blog post, where I guide you through creating a ColPali fine-tuning dataset using Qwen/Qwen2-VL-7B-Instruct to generate queries for a collection of UFO documents sourced from the Internet Archive.
The post covers: - Introduction to data for ColPali models - Using Qwen2-VL for retrieval query generation - Tips for better query generation
If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try π€
Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. π₯²
How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. π€
This is much faster + you do not lose out on any information + much easier to maintain too! π₯³