llava-hf (Llava Hugging Face)

merve

posted an update about 14 hours ago

Post

541

QwQ can see 🔥
Qwen team released QvQ, a large vision LM with reasoning 😱

it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out ⬇️
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!

Xenova

posted an update 6 days ago

Post

1864

Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
🚀 Faster and more accurate than Whisper
🔒 Privacy-focused (no data leaves your device)
⚡️ WebGPU accelerated (w/ WASM fallback)
🔥 Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web

nielsr

in llava-hf/llava-1.5-7b-hf 7 days ago

is there difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b ???

3

#28 opened 5 months ago by

hirasoooo

merve

posted an update 7 days ago

Post

2280

Aya by Cohere For AI can now see! 👀

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B 🌱 works on 8 languages! 🗣️

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya 👏
kudos @nahidalam and team

1 reply

·

merve

posted an update 8 days ago

Post

2892

Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧶

✨ the models come in 1.5B https://huggingface.co./Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co./Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co./Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co./spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co./spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful 🔥
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥

3 replies

·

merve

posted an update 13 days ago

Post

1664

A complete RAG pipeline includes a reranker, which ranks the documents to find the best document 📓
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 🔥 https://huggingface.co./learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms

RaushanTurganbay

in llava-hf/llava-v1.6-vicuna-7b-hf 14 days ago

Error when creating inputs_embeds for certain images

2

#8 opened 14 days ago by

Maximal

RaushanTurganbay

in llava-hf/LLaVA-NeXT-Video-7B-hf 14 days ago

ValueError in Forward pass

5

#11 opened 19 days ago by

Manisha2203

RaushanTurganbay

updated a model 15 days ago

llava-hf/LLaVA-Next-Video-7B-Qwen2-hf

Updated 15 days ago • 133

nielsr

in llava-hf/llava-v1.6-mistral-7b-hf 16 days ago

Potential ways to accelerate for image to text tasks?

8

#38 opened 19 days ago by

triscuiter

Xenova

posted an update 16 days ago

Post

2490

Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. 🤗 Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)

merve

posted an update 17 days ago

Post

5501

This week in open-source AI was insane 🤠 A small recap🕺🏻 merve/dec-6-releases-67545caebe9fc4776faac0a3

Multimodal 🖼️
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants 👏
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license ✨
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts

LLMs 💬
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license 🔥
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! 🔥 nearly 8TB pretraining data in many languages!

Image/Video Generation 🖼️
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux

Audio 🔊
> Indic-Parler-TTS is a new text2speech model made by community

merve

posted an update 18 days ago

Post

1497

New InternVL drop with a state-of-the-art 78B vision language model with MIT license 🔥 https://huggingface.co./collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks 👏 Try here OpenGVLab/InternVL

RaushanTurganbay

in llava-hf/llava-v1.6-mistral-7b-hf 19 days ago

Potential ways to accelerate for image to text tasks?

8

#38 opened 19 days ago by

triscuiter

RaushanTurganbay

updated a model 19 days ago

llava-hf/llava-onevision-qwen2-72b-ov-hf

Image-Text-to-Text • Updated 19 days ago • 1.84k • 6

RaushanTurganbay

in llava-hf/llava-onevision-qwen2-72b-ov-hf 19 days ago

why here is .md?

4

#2 opened 20 days ago by

wyccccc

nielsr

in llava-hf/llava-onevision-qwen2-72b-ov-hf 20 days ago

why here is .md?

4

#2 opened 20 days ago by

wyccccc

RaushanTurganbay

in llava-hf/vip-llava-13b-hf 22 days ago

Support for vllm/lmdeploy?

1

#1 opened 23 days ago by

solankibhargav

merve

posted an update 23 days ago

Post

2627

small but mighty 🔥
you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM 🫰🏻 also with gradient accumulation simulated batch size is 16 ✨
I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work 💝 https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb

merve

posted an update 23 days ago

Post

2862

Last week we were blessed with open-source models! A recap 💝
merve/nov-29-releases-674ccc255a57baf97b1e2d31

🖼️ Multimodal
> At Hugging Face we released SmolVLM, a performant and efficient smol vision language model 💗
> Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents 🤖
> Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length
> ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM
> Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model📖
> Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT 📕

💬 LLMs
> Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet 🔥
> AliBaba has released Marco-o1, a new open-source reasoning model 💥
> NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)

⏯️ Image/Video Generation
> Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation
> Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res ⏯️
> Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla 🏷️

Audio
> OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens

Llava Hugging Face

AI & ML interests

Recent Activity

llava-hf's activity

is there difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b ???

Error when creating inputs_embeds for certain images

ValueError in Forward pass

llava-hf/LLaVA-Next-Video-7B-Qwen2-hf

Potential ways to accelerate for image to text tasks?

Potential ways to accelerate for image to text tasks?

llava-hf/llava-onevision-qwen2-72b-ov-hf

why here is .md?

why here is .md?

Support for vllm/lmdeploy?

AI & ML interests

Recent Activity

Team members 7

llava-hf's activity

is there difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b ???

Error when creating inputs_embeds for certain images

ValueError in Forward pass

Potential ways to accelerate for image to text tasks?

Potential ways to accelerate for image to text tasks?

why here is .md?

why here is .md?

Support for vllm/lmdeploy?