victormiller commited on
Commit
bb3ebe4
β€’
1 Parent(s): 670d19f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -30
README.md CHANGED
@@ -22,20 +22,20 @@ and Evaluation Framework for Multimodal LLMs](https://arxiv.org/pdf/2406.20098).
22
  ## Web2Code Dataset
23
  Our Web2Code instruction tuning dataset construction and instruction generation process
24
  involves four key components:
25
- 1. Creation of new webpage image-code pair data: We generated
26
- high-quality HTML webpage-code pairs following the CodeAlpaca prompt [6] using GPT-3.5 and
27
- convert them into instruction-following data. (2) Refinement of existing webpage code generation
28
- data: We transform existing datasets including WebSight [ 22 ] and Pix2Code [ 4] into an instruction-
29
- following data format similar to LLaVA data [33 ], so they can be used as instruction-following data
30
- to train MLLMs. (3) Creation of a new text question-answer pair data: We generated a new question-
31
- answer pair dataset utilizing our new GPT-3.5 generated data from (1) for webpage understanding.
32
- (4) Refinement of existing webpage understanding data: We refine the WebSRC [ 10] question-answer
33
- data to improve its quality using the GPT-4.
34
 
35
  ## Evaluations
36
 
37
- ## Webpage Code Generation Benchmark
 
 
38
 
 
39
  | LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | VSA ↑ | CAD ↑ | TCC ↑ | UII ↑ | Overall ↑ |
40
  |------------------------|------|-----|------------------|------------------|--------|--------|--------|--------|------------|
41
  | **CrystalChat-7B** | | | | | 4.714 | 4.572 | 4.865 | 5.147 | 4.825 |
@@ -46,11 +46,11 @@ data to improve its quality using the GPT-4.
46
  | | βœ“ | | | | 6.871 | 6.660 | 6.589 | 6.897 | 6.754 |
47
  | | | βœ“ | | | 3.898 | 3.489 | 3.340 | 3.651 | 3.595 |
48
  | | βœ“ | βœ“ | βœ“ | βœ“ | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
49
-
50
  **Table 1:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.
51
 
52
- ## Webpage Understanding Accuracy
53
-
54
  | LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
55
  |------------------------|------|-----|------------------|------------------|--------------|
56
  | **CrystalChat-7B** | | | | | 73.94 |
@@ -60,6 +60,7 @@ data to improve its quality using the GPT-4.
60
  | | βœ“ | | | | 68.11 |
61
  | | | βœ“ | | | 70.82 |
62
  | | βœ“ | βœ“ | βœ“ | βœ“ | **71.23** |
 
63
 
64
  **Table 2:** The accuracy of webpage understanding under various data configurations and LLM backbones.
65
  All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
@@ -75,7 +76,7 @@ data to improve its quality using the GPT-4.
75
 
76
 
77
 
78
- ## Evaluation
79
 
80
  General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
81
  aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.
@@ -88,14 +89,11 @@ aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-task
88
 
89
  **Table 3:** Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
90
 
91
-
92
-
93
  ## Data and Training Details
94
 
95
  ### Pretrain Data
96
  LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
97
 
98
- TO-DO: Add new short caption data
99
  ### Finetune Data
100
  The finetuning data contains the following:
101
 
@@ -120,19 +118,6 @@ The dataset chosen was created by LLaVA with academic-task-oriented VQA data mix
120
 
121
  **Table 4:** Instruction-following Data Mixture of LLaVA-1.5.*
122
 
123
- #### Web2Code Data
124
-
125
- The Web2Code instruction tuning dataset was released in [Web2Code: A Large-scale Webpage-to-Code Dataset
126
- and Evaluation Framework for Multimodal LLMs](TODO: Add link). The dataset construction and instruction generation process involves four key components:
127
-
128
- DWCG: We created new webpage image-code pair data DWCG by generating high-quality HTML webpage-code pairs following the CodeAlpaca prompt using GPT-3.5 and converting them into instruction-following data.
129
-
130
- DWCG<sub>R</sub>: We refined existing webpage code generation data by transforming existing datasets, including WebSight and Pix2Code, into an instruction-following data format similar to LLaVA data, so they can be used as instruction-following data to train MLLMs.
131
-
132
- DWU: We created new text question-answer pair data by generating a new question-answer pair dataset utilizing our new GPT-3.5 generated data for webpage understanding.
133
-
134
- DWU<sub>R</sub>: We refined the WebSRC question-answer data to improve its quality using GPT-4.
135
-
136
  ### Code Datasets
137
 
138
  | Dataset | DWCG (ours) | DWCG<sub>R</sub> (ours) |
 
22
  ## Web2Code Dataset
23
  Our Web2Code instruction tuning dataset construction and instruction generation process
24
  involves four key components:
25
+ 1. Creation of new webpage image-code pair data **(DWCG)**: We generated high-quality HTML webpage-code pairs following the CodeAlpaca prompt using GPT-3.5 and convert them into instruction-following data.
26
+ 2. Refinement of existing webpage code generation data **(DWCG<sub>R</sub>)**: We transform existing datasets including WebSight and Pix2Code into an instruction- following data format similar to LLaVA data, so they can be used as instruction-following data to train MLLMs.
27
+ 3. Creation of a new text question-answer pair data **(DWU)**: We generated a new question-answer pair dataset utilizing our new GPT-3.5 generated data for webpage understanding.
28
+ 4. Refinement of existing webpage understanding data **(DWU<sub>R</sub>)** : We refine the WebSRC question-answer data to improve its quality using the GPT-4.
29
+
30
+ The Web2Code instruction tuning dataset was released in [Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs](https://huggingface.co/datasets/MBZUAI/Web2Code).
 
 
 
31
 
32
  ## Evaluations
33
 
34
+ ## Webpage Code Generation Benchmark (WCGB)
35
+
36
+ Utilizing the same images as the WUB, this benchmark evaluates a multimodal model tasked with generating HTML code from webpage images based on specific instructions. Unlike traditionalcode-level evaluations, this benchmark assesses the generated webpage’s fidelity at the image level. We convert the predicted HTML codes back into images using Selenium WebDriver to allow a direct visual comparison with the ground truth images. The evaluation, depicted on the left side of Figure 6, considers 10 different aspects, which are further categorized into four evaluation matrices using the GPT-4 Vision API.
37
 
38
+ ### Results
39
  | LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | VSA ↑ | CAD ↑ | TCC ↑ | UII ↑ | Overall ↑ |
40
  |------------------------|------|-----|------------------|------------------|--------|--------|--------|--------|------------|
41
  | **CrystalChat-7B** | | | | | 4.714 | 4.572 | 4.865 | 5.147 | 4.825 |
 
46
  | | βœ“ | | | | 6.871 | 6.660 | 6.589 | 6.897 | 6.754 |
47
  | | | βœ“ | | | 3.898 | 3.489 | 3.340 | 3.651 | 3.595 |
48
  | | βœ“ | βœ“ | βœ“ | βœ“ | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
49
+ | **Llama3-8B** | βœ“ | βœ“ | βœ“ | βœ“ | **8.522** | **8.564** | **8.421** | **8.611** | **8.530** |
50
  **Table 1:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.
51
 
52
+ ## Webpage Understanding Benchmark (WUB)
53
+ ### Results
54
  | LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
55
  |------------------------|------|-----|------------------|------------------|--------------|
56
  | **CrystalChat-7B** | | | | | 73.94 |
 
60
  | | βœ“ | | | | 68.11 |
61
  | | | βœ“ | | | 70.82 |
62
  | | βœ“ | βœ“ | βœ“ | βœ“ | **71.23** |
63
+ | **Llama3-8B** | βœ“ | βœ“ | βœ“ | βœ“ | **74.84** |
64
 
65
  **Table 2:** The accuracy of webpage understanding under various data configurations and LLM backbones.
66
  All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
 
76
 
77
 
78
 
79
+ ## General Evaluations
80
 
81
  General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
82
  aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.
 
89
 
90
  **Table 3:** Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
91
 
 
 
92
  ## Data and Training Details
93
 
94
  ### Pretrain Data
95
  LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
96
 
 
97
  ### Finetune Data
98
  The finetuning data contains the following:
99
 
 
118
 
119
  **Table 4:** Instruction-following Data Mixture of LLaVA-1.5.*
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ### Code Datasets
122
 
123
  | Dataset | DWCG (ours) | DWCG<sub>R</sub> (ours) |