victormiller
commited on
Commit
β’
9209438
1
Parent(s):
77ecb22
Update README.md
Browse files
README.md
CHANGED
@@ -31,6 +31,22 @@ The Web2Code instruction tuning dataset was released in [Web2Code: A Large-scale
|
|
31 |
|
32 |
## Evaluations
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
## Webpage Code Generation Benchmark (WCGB)
|
35 |
|
36 |
Utilizing the same images as the WUB, this benchmark evaluates a multimodal model tasked with generating HTML code from webpage images based on specific instructions. Unlike traditionalcode-level evaluations, this benchmark assesses the generated webpageβs fidelity at the image level. We convert the predicted HTML codes back into images using Selenium WebDriver to allow a direct visual comparison with the ground truth images. The evaluation, depicted on the left side of Figure 6, considers 10 different aspects, which are further categorized into four evaluation matrices using the GPT-4 Vision API.
|
@@ -48,24 +64,7 @@ Utilizing the same images as the WUB, this benchmark evaluates a multimodal mode
|
|
48 |
| | β | β | β | β | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
|
49 |
| **Llama3-8B** | β | β | β | β | **8.522** | **8.564** | **8.421** | **8.611** | **8.530** |
|
50 |
|
51 |
-
**Table
|
52 |
-
|
53 |
-
## Webpage Understanding Benchmark (WUB)
|
54 |
-
### Results
|
55 |
-
| LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
|
56 |
-
|------------------------|------|-----|------------------|------------------|--------------|
|
57 |
-
| **CrystalChat-7B** | | | | | 73.94 |
|
58 |
-
| | β | β | | | 73.48 |
|
59 |
-
| | β | β | β | β | **74.14** |
|
60 |
-
| **Vicuna-7B** | | | | | 71.12 |
|
61 |
-
| | β | | | | 68.11 |
|
62 |
-
| | | β | | | 70.82 |
|
63 |
-
| | β | β | β | β | **71.23** |
|
64 |
-
| **Llama3-8B** | β | β | β | β | **74.84** |
|
65 |
-
|
66 |
-
**Table 2:** The accuracy of webpage understanding under various data configurations and LLM backbones.
|
67 |
-
All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
|
68 |
-
|
69 |
|
70 |
### About CrystalChat-7B-Web2Code:
|
71 |
* 7 billion parameter LLM
|
|
|
31 |
|
32 |
## Evaluations
|
33 |
|
34 |
+
## Webpage Understanding Benchmark (WUB)
|
35 |
+
### Results
|
36 |
+
| LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
|
37 |
+
|------------------------|------|-----|------------------|------------------|--------------|
|
38 |
+
| **CrystalChat-7B** | | | | | 73.94 |
|
39 |
+
| | β | β | | | 73.48 |
|
40 |
+
| | β | β | β | β | **74.14** |
|
41 |
+
| **Vicuna-7B** | | | | | 71.12 |
|
42 |
+
| | β | | | | 68.11 |
|
43 |
+
| | | β | | | 70.82 |
|
44 |
+
| | β | β | β | β | **71.23** |
|
45 |
+
| **Llama3-8B** | β | β | β | β | **74.84** |
|
46 |
+
|
47 |
+
**Table 1:** The accuracy of webpage understanding under various data configurations and LLM backbones.
|
48 |
+
All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
|
49 |
+
|
50 |
## Webpage Code Generation Benchmark (WCGB)
|
51 |
|
52 |
Utilizing the same images as the WUB, this benchmark evaluates a multimodal model tasked with generating HTML code from webpage images based on specific instructions. Unlike traditionalcode-level evaluations, this benchmark assesses the generated webpageβs fidelity at the image level. We convert the predicted HTML codes back into images using Selenium WebDriver to allow a direct visual comparison with the ground truth images. The evaluation, depicted on the left side of Figure 6, considers 10 different aspects, which are further categorized into four evaluation matrices using the GPT-4 Vision API.
|
|
|
64 |
| | β | β | β | β | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
|
65 |
| **Llama3-8B** | β | β | β | β | **8.522** | **8.564** | **8.421** | **8.611** | **8.530** |
|
66 |
|
67 |
+
**Table 2:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
### About CrystalChat-7B-Web2Code:
|
70 |
* 7 billion parameter LLM
|