Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

merve's activity

posted an update 9 days ago
view post
Post
4813
Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

πŸ’¬ Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

πŸ–ΌοΈ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

πŸ–ΌοΈπŸ’¬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
πŸ–ΌοΈ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
reacted to averoo's post with πŸ”₯πŸ‘€ 12 days ago
view post
Post
3674
Hello, researchers! I've tried to made reading HF Daily Papers easier and made a tool that does reviews with LLMs like Claude 3.5, GPT-4o and sometimes FLUX.

πŸ“š Classification by topics
πŸ“… Sorting by publication date and HF addition date
πŸ”„ Syncing every 2 hours
πŸ’» Hosted on GitHub
🌏 English, Russian, and Chinese
πŸ“ˆ Top by week/month (in progress)

πŸ‘‰ https://hfday.ru

Let me know what do you think of it.
posted an update 13 days ago
view post
Post
4936
Hugging Face Hub Python library now comes with easy inference for vision language models! ✨

$ pip install huggingface_hub πŸ€—
  • 1 reply
Β·
posted an update 16 days ago
view post
Post
3402
Microsoft released a groundbreaking model that can be used for web automation, with MIT license πŸ”₯ microsoft/OmniParser

Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.

no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.

Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. πŸ‘


OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.
posted an update 18 days ago
view post
Post
2381
Lotus πŸͺ· is a new foundation model on monocular depth estimation ✨
Compared to previous diffusion-based MDE models, Lotus is modified for dense prediction tasks
Authors also released a model for normal prediction πŸ€—
Find everything in this collection merve/lotus-6718fb957dc1c85a47ca1210
posted an update 19 days ago
posted an update 20 days ago
posted an update 23 days ago
view post
Post
1927
It's raining depth estimation models β˜”οΈ
DepthPro is a zero-shot depth estimation model by Apple, it's fast, sharp and accurate πŸ”₯
Demo: akhaliq/depth-pro
Model: apple/DepthPro
Paper page: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073)

The model consists of two encoders: an encoder for patches and an image encoder πŸ–ΌοΈ The outputs of both are merged to decode to depth maps and get the focal length.
The model outperforms the previous state-of-the-art models in average of various benchmarks πŸ“‘
posted an update 24 days ago
posted an update 30 days ago
view post
Post
2819
This is not a drill πŸ’₯
HuggingChat is now multimodal with meta-llama/Llama-3.2-11B-Vision-Instruct! πŸ€—
This also comes with multimodal assistants, I have migrated my Marcus Aurelius advice assistant to Llama-Vision and Marcus can see now! πŸ˜„

Chat with Marcus: https://hf.co/chat/assistant/65bfed22022ba290531112f8
Start chatting with Llama-Vision 3.2 11B Instruct https://huggingface.co./chat/models/meta-llama/Llama-3.2-11B-Vision-Instruct
  • 1 reply
Β·
posted an update about 1 month ago
view post
Post
3700
Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV πŸ€—

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos πŸ‘

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker
posted an update about 1 month ago
view post
Post
3955
If you feel like you missed out for ECCV 2024, there's an app to browse the papers, rank for popularity, filter for open models, datasets and demos πŸ“

Get started at ECCV/ECCV2024-papers ✨
posted an update about 1 month ago
view post
Post
2689
NVIDIA just dropped a gigantic multimodal model called NVLM 72B πŸ¦–
nvidia/NVLM-D-72B
Paper page NVLM: Open Frontier-Class Multimodal LLMs (2409.11402)

The paper contains many ablation studies on various ways to use the LLM backbone πŸ‘‡πŸ»

🦩 Flamingo-like cross-attention (NVLM-X)
πŸŒ‹ Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D)
✨ a hybrid architecture (NVLM-H)

Checking evaluations, NVLM-D and NVLM-H are best or second best compared to other models πŸ‘

The released model is NVLM-D based on Qwen-2 Instruct, aligned with InternViT-6B using a huge mixture of different datasets

You can easily use this model by loading it through transformers' AutoModel 😍
posted an update about 1 month ago
view post
Post
2742
We've shipped new computer vision/multimodal tasks to Hugging Face Hub 🫑
Keypoint detection just landed with many docs, and goodies 🎁
https://huggingface.co./models?pipeline_tag=keypoint-detection

In Hugging Face transformers we have SuperPoint, foundation model for keypoint detection, check out the demo here merve/SuperPoint

Shipped transformers task guide on keypoint detection https://huggingface.co./docs/transformers/tasks/keypoint_detection πŸ“–

Also shipped the task page https://huggingface.co./tasks/keypoint-detection (easiest way to get started!) πŸ”–
reacted to davanstrien's post with πŸ§ πŸš€πŸ‘€ about 2 months ago
view post
Post
3118
ColPali is revolutionizing multimodal retrieval, but could it be even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I guide you through creating a ColPali fine-tuning dataset using Qwen/Qwen2-VL-7B-Instruct to generate queries for a collection of UFO documents sourced from the Internet Archive.

The post covers:
- Introduction to data for ColPali models
- Using Qwen2-VL for retrieval query generation
- Tips for better query generation

Check out the post here:
https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html

The resulting Hugging Face dataset: davanstrien/ufo-ColPali
  • 1 reply
Β·
posted an update 2 months ago
view post
Post
5475
I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
- vidore/colpali for retrieval πŸ“– it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation πŸ’¬ directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new 🐭 Byaldi library by @bclavie πŸ€—
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb
posted an update 2 months ago
view post
Post
3777
If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try πŸ€—

Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. πŸ₯²

How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🀝

This is much faster + you do not lose out on any information + much easier to maintain too! πŸ₯³

Multimodal RAG merve/multimodal-rag-66d97602e781122aae0a5139 πŸ’¬
Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e πŸ“–
  • 2 replies
Β·