File size: 10,978 Bytes
e5b4505
22e78ad
e5b4505
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9b0bc6
 
3c7463f
 
a9b0bc6
 
 
 
 
 
 
 
 
 
 
1343bb5
340201d
 
 
 
 
 
 
 
 
 
 
 
 
1343bb5
340201d
a9b0bc6
 
e5b4505
 
 
 
 
 
1343bb5
e5b4505
 
 
 
 
 
 
 
 
 
 
 
 
 
262ecb0
e5b4505
 
 
 
 
 
 
 
1343bb5
e5b4505
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262ecb0
e5b4505
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262ecb0
e5b4505
 
 
 
 
 
 
 
 
262ecb0
e5b4505
 
 
 
 
 
 
 
21f986c
c0862c6
e5b4505
21f986c
c0862c6
e5b4505
 
54805a4
c0862c6
e5b4505
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: mit
language:
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- nlp
- llm
- mllm
---

# CrystalChat-7B-Web2Code: a fully-reproducible vision large language model based on CrystalChat-7B LLM for webpage code generation

## Model Description

CrystalChat-7B based multi-modal large language model (MLLM) mimics the training recipe used for Vicuna-7B based [LLaVa-v1.5](https://huggingface.co./docs/transformers/main/model_doc/llava). CrystalChat-7B based MLLMs models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more at [TODO: Add paper link](). CrystalChat-7B-Web2Code MLLM is specialized in webpage images-to-html code generation.


## Evaluations

## Webpage Code Generation Benchmark

| LLM Backbone           | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | VSA ↑  | CAD ↑  | TCC ↑  | UII ↑  | Overall ↑ |
|------------------------|------|-----|------------------|------------------|--------|--------|--------|--------|------------|
| **CrystalChat-7B**     |      |     |                  |                  | 4.714  | 4.572  | 4.865  | 5.147  | 4.825      |
|                        | βœ“    |     |                  |                  | 7.900  | 8.001  | 8.204  | 8.215  | 8.080      |
|                        | βœ“    | βœ“   |                  |                  | 7.900  | 8.001  | 8.204  | 8.215  | 8.080      |
|                        | βœ“    | βœ“   | βœ“                | βœ“                | **8.384**  | **8.287**  | **8.417**  | **8.488**  | **8.394**      |
| **Vicuna-7B**          |      |     |                  |                  | 3.042  | 3.250  | 3.333  | 3.167  | 3.198      |
|                        | βœ“    |     |                  |                  | 6.871  | 6.660  | 6.589  | 6.897  | 6.754      |
|                        |      | βœ“   |                  |                  | 3.898  | 3.489  | 3.340  | 3.651  | 3.595      |
|                        | βœ“    | βœ“   | βœ“                | βœ“                | **7.876**  | **7.687**  | **7.267**  | **7.563**  | **7.598**      |

**Table 1:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.

## Webpage Understanding Accuracy

| LLM Backbone           | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
|------------------------|------|-----|------------------|------------------|--------------|
| **CrystalChat-7B**     |      |     |                  |                  | 73.94        |
|                        | βœ“    | βœ“   |                  |                  | 73.48        |
|                        | βœ“    | βœ“   | βœ“                | βœ“                | **74.14**    |
| **Vicuna-7B**          |      |     |                  |                  | 71.12        |
|                        | βœ“    |     |                  |                  | 68.11        |
|                        |      | βœ“   |                  |                  | 70.82        |
|                        | βœ“    | βœ“   | βœ“                | βœ“                | **71.23**    |

**Table 2:** The accuracy of webpage understanding under various data configurations and LLM backbones.
  All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.


### About CrystalChat-7B-Web2Code:
* 7 billion parameter LLM
* CLIP ViT-L/14-336px vision encoder
* Languages: English
* Models Released: CrystalChat-7B-Web2Code
* Trained in 2 stages
* License: MIT



## Evaluation

General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE. 

| LLM Backbone                      | MME-P   | MME-C  | POPE  | SciQA  | TextVQA |
|-----------------------------------|---------|--------|-------|--------|---------|
| CrystalCoder-7B | 1359.83 | 238.92 | 86.182 | 64.15 | 50.39   |
| CrystalChat-7B  | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
| Vicuna-7B | **1481.12** | 302.85 | **87.174** | **67.97** | 56.49   |

**Table 3:** Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*



## Data and Training Details

### Pretrain Data
LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.

TO-DO: Add new short caption data
### Finetune Data
The finetuning data contains the following:

#### LLaVa Finetuning Data
The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language. 

<!-- The full data sequence can be found [here](https://huggingface.co./datasets/liuhaotian/LLaVA-Instruct-150K) -->

| Data          | Size | Response formatting prompts                                              |
|---------------|------|--------------------------------------------------------------------------|
| LLaVA [36]    | 158K | –                                                                        |
| ShareGPT [46] | 40K  | –                                                                        |
| VQAv2 [19]    | 83K  | Answer the question using a single word or phrase.                       |
| GQA [21]      | 72K  | Answer the question using a single word or phrase.                       |
| OKVQA [41]    | 9K   | Answer the question using a single word or phrase.                       |
| OCRVQA [42]   | 80K  | Answer the question using a single word or phrase.                       |
| A-OKVQA [45]  | 66K  | Answer with the option’s letter from the given choices directly.         |
| TextCaps [47] | 22K  | Provide a one-sentence caption for the provided image.                   |
| RefCOCO [24, 40] | 48K  | Note: randomly choose between the two formats. Provide a short description for this region. |
| VG [25]       | 86K  | Provide the bounding box coordinate of the region this sentence describes. |
| **Total**     | **665K** |                                                                      |

**Table 4:** Instruction-following Data Mixture of LLaVA-1.5.*

#### Web2Code Data

The Web2Code instruction tuning dataset was released in [Web2Code: A Large-scale Webpage-to-Code Dataset
and Evaluation Framework for Multimodal LLMs](TODO: Add link). The dataset construction and instruction generation process involves four key components:

DWCG: We created new webpage image-code pair data DWCG by generating high-quality HTML webpage-code pairs following the CodeAlpaca prompt using GPT-3.5 and converting them into instruction-following data.

DWCG<sub>R</sub>: We refined existing webpage code generation data by transforming existing datasets, including WebSight and Pix2Code, into an instruction-following data format similar to LLaVA data, so they can be used as instruction-following data to train MLLMs.

DWU: We created new text question-answer pair data by generating a new question-answer pair dataset utilizing our new GPT-3.5 generated data for webpage understanding.

DWU<sub>R</sub>: We refined the WebSRC question-answer data to improve its quality using GPT-4.

### Code Datasets

| Dataset | DWCG (ours) | DWCG<sub>R</sub> (ours) |
|---------|-------------|-------------------|
| **Instruction** | βœ“ | βœ“ |
| **Source** | Synthetic | Synthetic |
| **Size** | 60K | 824.7K |
| **Avg Length (tokens)** | 471.8Β±162.3 | 652.85Β±157.0 |
| **Avg Tag Count** | 28.1Β±10.6 | 35.3Β±9.0 |
| **Avg DOM Depth** | 5.3Β±1.0 | 6.5Β±1.0 |
| **Avg Unique Tags** | 13.6Β±2.7 | 13.5Β±2.5 |

**Table 5:** DWCG is a newly generated GPT-3.5-based dataset, while DWCG<sub>R</sub> is the refined dataset that utilizes WebSight and Pix2Code datasets*


### Webpage Understanding Datasets

| Dataset       | DWU     | DWU<sub>R</sub> |
|---------------|---------|-----------------|
| **Instruction** | βœ“       | βœ“               |
| **Size**         | 243.5K | 51.5K           |

**Table 6:** Distribution of DWU and DWU<sub>R</sub> datasets. Both datasets include high-quality question-answer pairs for webpage understanding.*



## Examples


Example 1:

<center><img src="ori.png" alt="Original Input image"/></center>
Image 1: Original Input Image.

<center><img src="crystalchat.png" alt="CrsytalChat-7B model generated output"/></center>
Image 2. CrystalChat-7B-Web2Code model output.

Example 2: 
<center><img src="hand_draw1_.png" alt="CrsytalChat-7B model generated output"/></center>
Image 3. Hand-drawn webpage input to CrystalChat-7B-Web2Code generated output.

## Loading Crystal
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
                    "LLM360/CrystalChat-7B-MLLM",
                    padding_side="right",
                    trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "LLM360/CrystalChat-7B-MLLM",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map='auto',
    low_cpu_mem_usage=True
)
```



## LLM-360
LLM-360 is an open research lab enabling community-owned AGI through open-source large model research and development.

Crystal-based Models enables community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.

We believe in a future where artificial general intelligence (AGI) is created by the community, for the community. Through an open ecosystem of equitable computational resources, high-quality data, and flowing technical knowledge, we can ensure ethical AGI development and universal access for all innovators.

[Visit us](https://www.llm360.ai/)

## Citation

**BibTeX:**

```bibtex
@article{
      title={}, 
      author={},
      year={},
}
```