File size: 12,471 Bytes
56cb2bd 670d19f 56cb2bd 670d19f bb3ebe4 56cb2bd 9209438 bb3ebe4 56cb2bd bb3ebe4 56cb2bd bb3ebe4 77ecb22 9209438 56cb2bd bb3ebe4 56cb2bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
---
license: mit
language:
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- nlp
- llm
- mllm
datasets:
- MBZUAI/Web2Code
---
# CrystalChat-7B-Web2Code: a fully-reproducible vision large language model based on CrystalChat-7B LLM for webpage code generation
## Model Description
CrystalChat-7B based multi-modal large language model (MLLM) mimics the training recipe used for Vicuna-7B based [LLaVa-v1.5](https://huggingface.co./docs/transformers/main/model_doc/llava). CrystalChat-7B based MLLMs models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more at [Web2Code: A Large-scale Webpage-to-Code Dataset
and Evaluation Framework for Multimodal LLMs](https://arxiv.org/pdf/2406.20098). CrystalChat-7B-Web2Code MLLM is specialized in webpage images-to-html code generation.
## Web2Code Dataset
Our Web2Code instruction tuning dataset construction and instruction generation process
involves four key components:
1. Creation of new webpage image-code pair data **(DWCG)**: We generated high-quality HTML webpage-code pairs following the CodeAlpaca prompt using GPT-3.5 and convert them into instruction-following data.
2. Refinement of existing webpage code generation data **(DWCG<sub>R</sub>)**: We transform existing datasets including WebSight and Pix2Code into an instruction- following data format similar to LLaVA data, so they can be used as instruction-following data to train MLLMs.
3. Creation of a new text question-answer pair data **(DWU)**: We generated a new question-answer pair dataset utilizing our new GPT-3.5 generated data for webpage understanding.
4. Refinement of existing webpage understanding data **(DWU<sub>R</sub>)** : We refine the WebSRC question-answer data to improve its quality using the GPT-4.
The Web2Code instruction tuning dataset was released in [Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs](https://huggingface.co./datasets/MBZUAI/Web2Code).
## Evaluations
## Webpage Understanding Benchmark (WUB)
### Results
| LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
|------------------------|------|-----|------------------|------------------|--------------|
| **CrystalChat-7B** | | | | | 73.94 |
| | β | β | | | 73.48 |
| | β | β | β | β | **74.14** |
| **Vicuna-7B** | | | | | 71.12 |
| | β | | | | 68.11 |
| | | β | | | 70.82 |
| | β | β | β | β | **71.23** |
| **Llama3-8B** | β | β | β | β | **74.84** |
**Table 1:** The accuracy of webpage understanding under various data configurations and LLM backbones.
All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
## Webpage Code Generation Benchmark (WCGB)
Utilizing the same images as the WUB, this benchmark evaluates a multimodal model tasked with generating HTML code from webpage images based on specific instructions. Unlike traditionalcode-level evaluations, this benchmark assesses the generated webpageβs fidelity at the image level. We convert the predicted HTML codes back into images using Selenium WebDriver to allow a direct visual comparison with the ground truth images. The evaluation, depicted on the left side of Figure 6, considers 10 different aspects, which are further categorized into four evaluation matrices using the GPT-4 Vision API.
### Results
| LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | VSA β | CAD β | TCC β | UII β | Overall β |
|------------------------|------|-----|------------------|------------------|--------|--------|--------|--------|------------|
| **CrystalChat-7B** | | | | | 4.714 | 4.572 | 4.865 | 5.147 | 4.825 |
| | β | | | | 7.900 | 8.001 | 8.204 | 8.215 | 8.080 |
| | β | β | | | 7.900 | 8.001 | 8.204 | 8.215 | 8.080 |
| | β | β | β | β | **8.384** | **8.287** | **8.417** | **8.488** | **8.394** |
| **Vicuna-7B** | | | | | 3.042 | 3.250 | 3.333 | 3.167 | 3.198 |
| | β | | | | 6.871 | 6.660 | 6.589 | 6.897 | 6.754 |
| | | β | | | 3.898 | 3.489 | 3.340 | 3.651 | 3.595 |
| | β | β | β | β | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
| **Llama3-8B** | β | β | β | β | **8.522** | **8.564** | **8.421** | **8.611** | **8.530** |
**Table 2:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.
### About CrystalChat-7B-Web2Code:
* 7 billion parameter LLM
* CLIP ViT-L/14-336px vision encoder
* Languages: English
* Models Released: CrystalChat-7B-Web2Code
* Trained in 2 stages
* License: MIT
## General Evaluations
General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our modelsβ ability toward anti-hallucination through POPE.
| LLM Backbone | MME-P | MME-C | POPE | SciQA | TextVQA |
|-----------------------------------|---------|--------|-------|--------|---------|
| CrystalCoder-7B | 1359.83 | 238.92 | 86.182 | 64.15 | 50.39 |
| CrystalChat-7B | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
| Vicuna-7B | **1481.12** | 302.85 | **87.174** | **67.97** | 56.49 |
**Table 3:** Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
## Data and Training Details
### Pretrain Data
LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
### Finetune Data
The finetuning data contains the following:
#### LLaVa Finetuning Data
The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language.
<!-- The full data sequence can be found [here](https://huggingface.co./datasets/liuhaotian/LLaVA-Instruct-150K) -->
| Data | Size | Response formatting prompts |
|---------------|------|--------------------------------------------------------------------------|
| LLaVA [36] | 158K | β |
| ShareGPT [46] | 40K | β |
| VQAv2 [19] | 83K | Answer the question using a single word or phrase. |
| GQA [21] | 72K | Answer the question using a single word or phrase. |
| OKVQA [41] | 9K | Answer the question using a single word or phrase. |
| OCRVQA [42] | 80K | Answer the question using a single word or phrase. |
| A-OKVQA [45] | 66K | Answer with the optionβs letter from the given choices directly. |
| TextCaps [47] | 22K | Provide a one-sentence caption for the provided image. |
| RefCOCO [24, 40] | 48K | Note: randomly choose between the two formats. Provide a short description for this region. |
| VG [25] | 86K | Provide the bounding box coordinate of the region this sentence describes. |
| **Total** | **665K** | |
**Table 4:** Instruction-following Data Mixture of LLaVA-1.5.*
### Code Datasets
| Dataset | DWCG (ours) | DWCG<sub>R</sub> (ours) |
|---------|-------------|-------------------|
| **Instruction** | β | β |
| **Source** | Synthetic | Synthetic |
| **Size** | 60K | 824.7K |
| **Avg Length (tokens)** | 471.8Β±162.3 | 652.85Β±157.0 |
| **Avg Tag Count** | 28.1Β±10.6 | 35.3Β±9.0 |
| **Avg DOM Depth** | 5.3Β±1.0 | 6.5Β±1.0 |
| **Avg Unique Tags** | 13.6Β±2.7 | 13.5Β±2.5 |
**Table 5:** DWCG is a newly generated GPT-3.5-based dataset, while DWCG<sub>R</sub> is the refined dataset that utilizes WebSight and Pix2Code datasets*
### Webpage Understanding Datasets
| Dataset | DWU | DWU<sub>R</sub> |
|---------------|---------|-----------------|
| **Instruction** | β | β |
| **Size** | 243.5K | 51.5K |
**Table 6:** Distribution of DWU and DWU<sub>R</sub> datasets. Both datasets include high-quality question-answer pairs for webpage understanding.*
## Examples
Example 1:
<center><img src="ori.png" alt="Original Input image"/></center>
**Image 1:** Original Input Image.
<center><img src="crystalchat.png" alt="CrsytalChat-7B model generated output"/></center>
**Image 2:** CrystalChat-7B-Web2Code model output.
Example 2:
<center><img src="hand_draw1_.png" alt="CrsytalChat-7B model generated output"/></center>
**Image 3:** Hand-drawn webpage input to CrystalChat-7B-Web2Code generated output.
## Loading Crystal
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"LLM360/CrystalChat-7B-MLLM",
padding_side="right",
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"LLM360/CrystalChat-7B-MLLM",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map='auto',
low_cpu_mem_usage=True
)
```
## LLM-360
LLM-360 is an open research lab enabling community-owned AGI through open-source large model research and development.
Crystal-based Models enables community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.
We believe in a future where artificial general intelligence (AGI) is created by the community, for the community. Through an open ecosystem of equitable computational resources, high-quality data, and flowing technical knowledge, we can ensure ethical AGI development and universal access for all innovators.
[Visit us](https://www.llm360.ai/)
## Citation
**BibTeX:**
```bibtex
@article{yun2024web2code,
title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
author={Yun, Sukmin and Lin, Haokun and Thushara, Rusiru and Bhat, Mohammad Qazim and Wang, Yongxin and Jiang, Zutao and Deng, Mingkai and Wang, Jinhong and Tao, Tianhua and Li, Junbo and others},
journal={arXiv preprint arXiv:2406.20098},
year={2024}
}
``` |