File size: 6,933 Bytes
5cb0946 5418ee9 a3531e6 5cb0946 e279690 2703754 064fcb5 2703754 120c036 2703754 120c036 2703754 120c036 2703754 5cb0946 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
license: apache-2.0
language:
- zh
- en
metrics:
- bleu
base_model:
- Qwen/Qwen2.5-7B-Instruct
---
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
| | | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
:-- | :-- | :-: | :-: | :-: | :-: |
| RoboVQA | BLEU1 | <span style="color:red">73.16</span> | 38.12 | - | 54.9 |
| | BLEU2 | <span style="color:red">66.39</span> | 33.56 | - | 44.2 |
| | BLEU3 | <span style="color:red">60.61</span> | 31.76 | - | 39.5 |
| | BLEU4 | <span style="color:red">56.56</span> | 30.97 | - | 36.3 |
| OpenEQA | Object State Recognition | <span style="color:red">71.83</span> | - | 63.2 | - |
| | Object Recognition | <span style="color:red">49.46</span> | - | 43.4 | - |
| | Functional Reasoning | 54.38 | - | <span style="color:red">57.4</span> | - |
| | Spatial Understanding | <span style="color:red">48.64</span> | - | 33.6 | - |
| | Attribute Recognition | <span style="color:red">67.08</span> | - | 57.2 | - |
| | World Knowledge | <span style="color:red">53.87</span> | - | 50.7 | - |
| | Object Localization | <span style="color:red">43.06</span> | - | 42.0 | - |
## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
| Dataset | Split | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v | GPT-4o |
| :-- | :-: | :-: | :-: | :-: | :-: |
| A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
| ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
| DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
| InfoVQA | val | 73.9 | 70.7 | - | - |
| InfoVQA | test | 70.0 | 68.8 | - | - |
| MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
| MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
| OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
| RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
| SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
| MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
| MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
| MME | test | 578/1603 | 418/1580 | 517/1409 | - |
## Usage
### A. Installation
```bash
git clone https://github.com/deepglint/unicom
cd unicom
# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"
```
### B. Inference
```bash
git clone https://github.com/deepglint/unicom
cd unicom
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir DeepGlint-AI/MLCD-Embodied-7B
# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
# >> User: <image>请你介绍一下这个图片
# >> Assistant: 这是一幅充满创意的猫头艺术作品。它采用了多色渐变和抽象风格,将猫的头部描绘成一个充满活力和色彩的视觉冲击。猫的眼睛用金色渲染,显得非常有神采,
# 而粉色的鼻子则增添了一丝可爱感。整体设计融合了现代艺术与传统猫头图案,创造出一种既独特又引人入胜的视觉效果。。
```
### C. Evaluation for Embodied Ability
#### Step 1
Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)
#### Step 2
Converting raw data into the format required for model evaluation.
```bash
# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py
# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py
```
#### Step 3
Make sure that your top-level directory structure should look like this:
```
|--/path/to/your/benchmarks
| |--OpenEQA
| | |--openeqa_scannet.parquet
| | |--openeqa_hm3d.parquet
| |--RoboVQA
| |--robovqa.parquet
|--/path/to/your/images
|--openeqa_val
| |--scannet-v0
| | |--002-scannet-scene0709_00
| | |--xxx-scannet-scenexxxx_xx
| |--hm3d-v0
| |--000-hm3d-BFRyYbPCCPE
| |--xxx-hm3d-xxxxxxxxxxx
|--robovqa_val
|--robovqa_221911
|--robovqa_xxxxxx
```
#### Step 4
Run script for evaluation
```bash
# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model
```
### D. Evaluation for General Ability
Install the evaluation tool and execute the evaluation script:
```bash
pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
--main_process_port=12444 \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
--tasks mme \
--batch_size 1 \
--log_samples \
--log_samples_suffix mlcd \
--output_path ./eval_log/
```
We would like to express our gratitude to [Huajie Tan](https://huggingface.co./tanhuajie2001), [Yumeng Wang](https://huggingface.co./devymex), [Yin Xie](https://huggingface.co./Yin-Xie) for his significant contributions to the experimental validation in MLLMs. |