Feature Extraction
Safetensors
clip_vision_model
Vision
LLaVA
File size: 6,342 Bytes
8fec223
 
 
 
 
 
 
 
 
 
 
e6656a0
 
 
8fec223
 
 
 
e6656a0
 
 
8fec223
 
 
 
 
 
 
 
 
6c57ce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14fd13e
8fec223
 
 
 
 
 
 
 
abb9be2
6c57ce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd75627
8fec223
 
 
 
 
 
 
 
 
4c07b27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
datasets:
- laion/laion400m
- kakaobrain/coyo-700m
pipeline_tag: feature-extraction
tags:
- Vision
- LLaVA
---




[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)  
## Model
We used the same Vision Transformer architecture  [ViT-L/14@336px as CLIP](https://huggingface.co./openai/clip-vit-large-patch14-336).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


## Data
Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets. 

## Performance and Limitations

### A. MLLMs Evaluation Results
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co./Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

| Vision Tower    | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|:----------------|:----------------------|:----------------------|
| LLM             | Qwen2.5-7B            | Qwen2.5-7B            |
| AI2D            | <span style="color:red">76.98</span>           | 73.15                 |
| ScienceQA_img   | <span style="color:red">78.09</span>           | 76.35                 |
| GQA             | <span style="color:red">64.17</span>           | 63.31                 |
| InfoVQA_val     | <span style="color:red">43.48</span>           | 38.88                 |
| MMBench_cn_dev  | <span style="color:red">74.83</span>           | 72.51                 |
| MMBench_en_dev  | <span style="color:red">76.37</span>           | 74.57                 |
| MME(cognition)  | <span style="color:red">432</span>             | 384                   |
| MME(perception) | <span style="color:red">1598</span>            | 1512                  |
| SeedBench       | <span style="color:red">68.20</span>           | 66.80                 |
| SeedBench_img   | <span style="color:red">73.75</span>           | 72.72                 |
| MMStar          | <span style="color:red">50.98</span>           | 48.98                 |
| MMMU            | <span style="color:red">44.30</span>           | 44.20                 |
| OCRBench        | <span style="color:red">531.00</span>          | 525.00                |
| ChartQA         | <span style="color:red">67.84</span>           | 66.52                 |
| DocVQA_val      | <span style="color:red">76.46</span>           | 75.21                 |
| POPE            | 88.69                 | <span style="color:red">88.83</span>  |
| TextVQA_val     | 61.69                 | <span style="color:red">62.47</span>  |




### B. Linear Probe Evaluation Results
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

| Dataset        | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|:---------------|:----------------------|:----------------------|
| AVG            | <span style="color:red">87.15</span>             | 85.35                 |
| Food101        | <span style="color:red">96.21</span>             | 95.90                 |
| CIFAR-10       | <span style="color:red">99.36</span>             | 97.90                 |
| CIFAR-100      | <span style="color:red">93.69</span>             | 87.40                 |
| Birdsnap       | <span style="color:red">88.18</span>             | 79.90                 |
| SUN397         | <span style="color:red">87.96</span>             | 82.20                 |
| Stanford Cars  | <span style="color:red">95.16</span>             | 91.50                 |
| FGVC Aircraft  | <span style="color:red">86.38</span>             | 71.60                 |
| Describable Textures Dataset | <span style="color:red">86.70</span> | 83.00                 |
| Oxford-IIIT Pets | <span style="color:red">96.27</span>          | 95.10                 |
| Caltech-101    | <span style="color:red">97.92</span>             | 96.00                 |
| Flowers102     | <span style="color:red">99.58</span>             | 99.20                 |
| MNIST          | 98.67                 | <span style="color:red">99.20</span>             |
| STL-10         | 99.28                 | <span style="color:red">99.70</span>             |
| EuroSAT        | <span style="color:red">99.06</span>             | 98.10                 |
| RESISC45       | <span style="color:red">95.48</span>             | 94.90                 |
| GTSRB          | 92.32                 | <span style="color:red">92.40</span>             |
| KITTI          | <span style="color:red">75.39</span>             | 69.20                 |
| Country211     | 38.12                 | <span style="color:red">46.40</span>             |
| PatchCamelyon  | <span style="color:red">88.00</span>             | 85.60                 |
| UCF101         | <span style="color:red">92.86</span>             | 92.00                 |
| Kinetics-700   | <span style="color:red">73.35</span>             | 73.00                 |
| CLEVR          | <span style="color:red">64.40</span>             | 60.30                 |
| Hateful Memes  | 72.00                 | <span style="color:red">77.30</span>             |
| SST-2          | 76.33                 | <span style="color:red">80.50</span>             |
| ImageNet       | <span style="color:red">86.30</span>             | 85.40                 |


### C. Limitations

Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.


## Acknowledgments

We would like to express our gratitude to [Xie Yin](https://huggingface.co./Yin-Xie) and [Yumeng Wang](https://huggingface.co./devymex) for their significant contributions to the experimental validation in MLLMs.