--- license: other license_name: tongyi-qianwen license_link: https://huggingface.co./Qwen/Qwen2-VL-72B-Instruct/blob/main/LICENSE language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers base_model: - Qwen/Qwen2-VL-72B --- This preview model is trained by 1ep with LoRA. Another checkpoint with full training: https://huggingface.co./osunlp/UGround-V1-72B (Slightly better on ScreenSpot-Pro and ScreenSpot) # Qwen2-VL-72B-Instruct ## Introduction We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model, representing nearly a year of innovation. ### What’s New in Qwen2-VL? #### Key Enhancements: * **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. * **Understanding videos of 20min+**: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. * **Agent that can operate your mobiles, robots, etc.**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. * **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. #### Model Architecture Updates: * **Naive Dynamic Resolution**: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
* **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
We have three models with 2, 8 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub](https://github.com/QwenLM/Qwen2-VL).
## Evaluation
### Image Benchmarks
| Benchmark | Previous SoTA
(Open-source LVLM) | Claude-3.5 Sonnet | GPT-4o | **Qwen2-VL-72B**
| :--- | :---: | :---: | :---: | :---: |
| MMMUval | 58.3 | 68.3 | **69.1** | 64.5
| DocVQAtest | 94.1 | 95.2 | 92.8 | **96.5**
| InfoVQAtest | 82.0 | - | - | **84.5**
| ChartQAtest | 88.4 | **90.8** | 85.7 | 88.3
| TextVQAval | 84.4 | - | - | **85.5**
| OCRBench | 852 | 788 | 736 | **877**
| MTVQA | 17.3 | 25.7 | 27.8 | **30.9**
| VCRen easy | 84.67 | 63.85 | 91.55 | **91.93**
| VCRzh easy | 22.09 | 1.0| 14.87 | **65.37**
| RealWorldQA | 72.2 | 60.1 | 75.4 | **77.8**
| MMEsum | 2414.7 | 1920.0 | 2328.7 | **2482.7**
| MMBench-ENtest | **86.5** | 79.7 | 83.4 | **86.5**
| MMBench-CNtest | 86.3 | 80.7 | 82.1 | **86.6**
| MMBench-V1.1test | 85.5 | 78.5 | 82.2 | **85.9**
| MMT-Benchtest | 63.4 | - | 65.5 | **71.7**
| MMStar | 67.1 | 62.2 | 63.9 | **68.3**
| MMVetGPT-4-Turbo | 65.7 | 66.0 | 69.1 | **74.0**
| HallBenchavg | 55.2 | 49.9 | 55.0 | **58.1**
| MathVistatestmini | 67.5 | 67.7 | 63.8 | **70.5**
| MathVision | 16.97 | - | **30.4** | 25.9
### Video Benchmarks
| Benchmark | Previous SoTA
(Open-source LVLM) | Gemini 1.5-Pro | GPT-4o | **Qwen2-VL-72B**
| :--- | :---: | :---: | :---: | :---: |
| MVBench | 69.6 | - | - | **73.6**
| PerceptionTesttest | 66.9 | - | - | **68.0**
| EgoSchematest | 62.0 | 63.2 | 72.2 | **77.9**
| Video-MME
(wo/w subs) | 66.3/69.6 | **75.0**/**81.3** | 71.9/77.2 | 71.2/77.8
### Agent Benchmarks
| |Benchmark | Metric | Previous SoTA | GPT-4o | **Qwen2-VL-72B** |
| :-- | :-- | :--: | :--: | :--: | :--: |
| General | FnCall[1] | TM | - | 90.2 | **93.1** |
| | | EM | - | 50.0 | **53.2** |
| Game | Number Line | SR | 89.4[2] | 91.5 | **100.0** |
| | BlackJack | SR | 40.2[2] | 34.5 | **42.6** |
| | EZPoint | SR | 50.0[2] | 85.5 | **100.0** |
| | Point24 | SR | 2.6[2] | 3.0 | **4.5** |
| Android | AITZ | TM | 83.0[3] | 70.0 | **89.6** |
| | | EM | 47.7[3] | 35.3 | **72.1** |
| AI2THOR | ALFREDvalid-unseen | SR | 67.7[4] | - | **67.8** |
| | | GC | 75.3[4] | - | **75.8** |
| VLN | R2Rvalid-unseen | SR | **79.0** | 43.7[5] | 51.7 |
| | REVERIEvalid-unseen | SR | **61.0** | 31.6[5] | 31.0 |
SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM[6].
1. Self-Curated Function Call Benchmark by Qwen Team
2. Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
3. Android in the Zoo: Chain-of-Action-Thought for GUI Agents
4. ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
5. MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
6. Segment Anything.
### Multilingual Benchmarks
## Requirements
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
```
KeyError: 'qwen2_vl'
```
## Quickstart
We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
```bash
pip install qwen-vl-utils
```
Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-72B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
Models
AR
DE
FR
IT
JA
KO
RU
TH
VI
AVG
Qwen2-VL-72B
20.7
36.5
44.1
42.8
21.6
37.4
15.6
17.7
41.6
30.9
GPT-4o
20.2
34.2
41.2
32.7
20.0
33.9
11.5
22.5
34.2
27.8
Claude3 Opus
15.1
33.4
40.6
34.4
19.4
27.2
13.0
19.5
29.1
25.7
Gemini Ultra
14.7
32.3
40.0
31.8
12.3
17.2
11.8
20.3
28.6
23.2
Without qwen_vl_utils
```python
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
```
Multi image inference
```python
# Messages containing multiple images and a text query
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
Video inference
```python
# Messages containing a images list as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
Batch inference
```python
# Sample messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
```