Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2-7B
|
4 |
+
- google/siglip-so400m-patch14-384
|
5 |
+
license: apache-2.0
|
6 |
+
---
|
7 |
+
|
8 |
+
<style>
|
9 |
+
.inline-img {
|
10 |
+
display: inline-block;
|
11 |
+
/* 或者使用 display: inline-block; 以便能设置宽度和高度 */
|
12 |
+
}
|
13 |
+
</style>
|
14 |
+
|
15 |
+
<h2>
|
16 |
+
<a href="https://github.com/hanhuang22/AITQE">
|
17 |
+
<img class="inline-img" src="https://cdn-uploads.huggingface.co/production/uploads/65d86142a3c18e931641be25/ZT5e7XI0tWBfny-YKfnSV.png" alt="Logo" width=40>
|
18 |
+
Beyond Filtering:<br>Adaptive Image-Text Quality Enhancement for MLLM Pretraining
|
19 |
+
</a>
|
20 |
+
</h2>
|
21 |
+
|
22 |
+
arxiv:
|
23 |
+
|
24 |
+
github: https://github.com/hanhuang22/AITQE
|
25 |
+
|
26 |
+
|
27 |
+
[2024.10.12] Release the inference code and pre-trained model of AITQE.
|
28 |
+
|
29 |
+
We propose the **A**daptive **I**mage-**T**ext **Q**uality **E**nhancer, **AITQE**, a model that dynamically assesses and enhances the quality of image-text pairs. The conventional method (a) discards low-quality samples in raw data, reducing the amount of pretraining data, while our AITQE (b) enhances low-quality samples, retaining the same volume of data for MLLMs pretraining.
|
30 |
+
|
31 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/65d86142a3c18e931641be25/CvTD-H7fZSx8F1BZ3a-WY.png" alt="illus" width="800">
|
32 |
+
|
33 |
+
Specifically, for pairs exhibiting low quality-such as low semantic similarity between modalities or subpar linguistic quality, AITQE performs text rewriting, generating high-quality text based on the input image and the raw low-quality text.
|
34 |
+
|
35 |
+
Use the code from github:
|
36 |
+
```bash
|
37 |
+
python inference.py \
|
38 |
+
--model_path /path/to/AITQE \
|
39 |
+
--output_all
|
40 |
+
--gpu_id 0 \
|
41 |
+
--image_path ./figs/test.png \
|
42 |
+
--caption "Some random text to the image like this is a test"
|
43 |
+
```
|
44 |
+
|
45 |
+
and get the following output:
|
46 |
+
|
47 |
+
<pre style="white-space: pre-wrap; word-wrap: break-word;">
|
48 |
+
{"Recaption": "A man stands in front of a checklist of customer service questions, including 'Do you take each customer seriously?' and 'Do you qualify customers properly?'", "Overall Score": "2<Overall>", "Overall Explanation": "The caption is vague and does not accurately describe the image or its content. It lacks detail and relevance to the checklist shown in the image.", "Text Quality Score": 3, "Text Quality Explanation": "The caption is grammatically correct but lacks clarity and relevance to the image. It is vague and does not provide a meaningful description.", "Image-Text Matching Score": 2, "Image-Text Matching Explanation": "The caption does not accurately describe the image, which features a checklist of customer service questions. The caption is unrelated to the content of the image.", "Object Detail Score": 2, "Object Detail Explanation": "The caption does not provide any details about the objects in the image, such as the checklist or the person in the background.", "Semantic Understanding Score": 2, "Semantic Understanding Explanation": "The caption fails to convey any understanding of the image's context or purpose, which is about customer service evaluation.", "Text/Chart Description Score": 2, "Text/Chart Description Explanation": "The caption does not describe the text in the image, which is a checklist of customer service questions."}
|
49 |
+
</pre>
|