File size: 6,933 Bytes
5cb0946
 
5418ee9
 
 
 
 
 
a3531e6
5cb0946
 
e279690
 
 
2703754
064fcb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2703754
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120c036
 
 
 
 
 
 
2703754
 
 
 
 
 
 
 
120c036
 
 
2703754
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120c036
 
 
 
 
 
 
 
 
 
 
2703754
 
5cb0946
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: apache-2.0
language:
- zh
- en
metrics:
- bleu
base_model:
- Qwen/Qwen2.5-7B-Instruct
---


[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)  


## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA



|                |                   | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
 :-- | :-- | :-: | :-: | :-: | :-: |
| RoboVQA        | BLEU1             | <span style="color:red">73.16</span>       | 38.12                   |            -              | 54.9      |
|                | BLEU2             | <span style="color:red">66.39</span>       | 33.56                   |            -              | 44.2      |
|                | BLEU3             | <span style="color:red">60.61</span>       | 31.76                   |            -              | 39.5      |
|                | BLEU4             | <span style="color:red">56.56</span>       | 30.97                   |            -              | 36.3      |
| OpenEQA        | Object State Recognition | <span style="color:red">71.83</span>   |          -               | 63.2   |            -              |
|                | Object Recognition        | <span style="color:red">49.46</span>  |          -               | 43.4   |            -              |
|                | Functional Reasoning      | 54.38                                 |          -               | <span style="color:red">57.4</span> |            -              |
|                | Spatial Understanding     | <span style="color:red">48.64</span>  |          -               | 33.6   |            -              |
|                | Attribute Recognition     | <span style="color:red">67.08</span>  |          -               | 57.2   |            -              |
|                | World Knowledge           | <span style="color:red">53.87</span>  |          -               | 50.7   |            -              |
|                | Object Localization       | <span style="color:red">43.06</span>  |          -               | 42.0   |            -              |




## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

| Dataset     | Split   | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v   | GPT-4o |
| :-- | :-: | :-: | :-: | :-: | :-: |
| A12D        | test    | 79.9             | 81.4               | 78.2     | 94.2   |
| ChartQA     | test    | 83.0             | 80.0               | 78.5     | 85.7   |
| DocVQA      | test    | 91.6             | 87.5               | 88.4     | 92.8   |
| InfoVQA     | val     | 73.9             | 70.7               | -        | -      |
| InfoVQA     | test    | 70.0             | 68.8               | -        | -      |
| MMMU        | val     | 47.3             | 48.8               | 56.8     | 69.1   |
| MMStar      | test    | 58.5             | 61.7               | 57.1     | 63.9   |
| OCRBench    | -       | 749.0            | 697.0              | 656.0    | 805.0  |
| RealWorldQA | test    | 68.9             | 66.3               | 61.4     | 58.6   |
| SeedBench   | image   | 74.9             | 75.4               | 49.9     | 76.2   |
| MMbench     | en-dev  | 81.1             | 83.2               | 81.3     | 83.4   |
| MMbench     | en-test | 80.1             | 80.8               | 75.0     | -      |
| MME         | test    | 578/1603         | 418/1580           | 517/1409 | -      |

## Usage

### A. Installation

```bash
git clone https://github.com/deepglint/unicom
cd unicom

# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"
```

### B. Inference

```bash
git clone https://github.com/deepglint/unicom
cd unicom
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir DeepGlint-AI/MLCD-Embodied-7B

# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
# >> User: <image>请你介绍一下这个图片
# >> Assistant: 这是一幅充满创意的猫头艺术作品。它采用了多色渐变和抽象风格,将猫的头部描绘成一个充满活力和色彩的视觉冲击。猫的眼睛用金色渲染,显得非常有神采,
# 而粉色的鼻子则增添了一丝可爱感。整体设计融合了现代艺术与传统猫头图案,创造出一种既独特又引人入胜的视觉效果。。
```

### C. Evaluation for Embodied Ability

#### Step 1

Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)

#### Step 2

Converting raw data into the format required for model evaluation.
```bash
# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py

# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py
```

#### Step 3

Make sure that your top-level directory structure should look like this:
```
|--/path/to/your/benchmarks
|  |--OpenEQA
|  |  |--openeqa_scannet.parquet
|  |  |--openeqa_hm3d.parquet
|  |--RoboVQA
|     |--robovqa.parquet
|--/path/to/your/images
   |--openeqa_val
   |  |--scannet-v0
   |  |  |--002-scannet-scene0709_00
   |  |  |--xxx-scannet-scenexxxx_xx
   |  |--hm3d-v0
   |     |--000-hm3d-BFRyYbPCCPE
   |     |--xxx-hm3d-xxxxxxxxxxx
   |--robovqa_val
      |--robovqa_221911
      |--robovqa_xxxxxx
```

#### Step 4

Run script for evaluation 
```bash
# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model
```

### D. Evaluation for General Ability

Install the evaluation tool and execute the evaluation script:
```bash
pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
    --main_process_port=12444 \
    --num_processes=8 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
    --tasks mme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix mlcd \
    --output_path ./eval_log/
```

We would like to express our gratitude to [Huajie Tan](https://huggingface.co./tanhuajie2001), [Yumeng Wang](https://huggingface.co./devymex), [Yin Xie](https://huggingface.co./Yin-Xie) for his significant contributions to the experimental validation in MLLMs.