victormiller commited on
Commit
9209438
β€’
1 Parent(s): 77ecb22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -18
README.md CHANGED
@@ -31,6 +31,22 @@ The Web2Code instruction tuning dataset was released in [Web2Code: A Large-scale
31
 
32
  ## Evaluations
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ## Webpage Code Generation Benchmark (WCGB)
35
 
36
  Utilizing the same images as the WUB, this benchmark evaluates a multimodal model tasked with generating HTML code from webpage images based on specific instructions. Unlike traditionalcode-level evaluations, this benchmark assesses the generated webpage’s fidelity at the image level. We convert the predicted HTML codes back into images using Selenium WebDriver to allow a direct visual comparison with the ground truth images. The evaluation, depicted on the left side of Figure 6, considers 10 different aspects, which are further categorized into four evaluation matrices using the GPT-4 Vision API.
@@ -48,24 +64,7 @@ Utilizing the same images as the WUB, this benchmark evaluates a multimodal mode
48
  | | βœ“ | βœ“ | βœ“ | βœ“ | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
49
  | **Llama3-8B** | βœ“ | βœ“ | βœ“ | βœ“ | **8.522** | **8.564** | **8.421** | **8.611** | **8.530** |
50
 
51
- **Table 1:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.
52
-
53
- ## Webpage Understanding Benchmark (WUB)
54
- ### Results
55
- | LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
56
- |------------------------|------|-----|------------------|------------------|--------------|
57
- | **CrystalChat-7B** | | | | | 73.94 |
58
- | | βœ“ | βœ“ | | | 73.48 |
59
- | | βœ“ | βœ“ | βœ“ | βœ“ | **74.14** |
60
- | **Vicuna-7B** | | | | | 71.12 |
61
- | | βœ“ | | | | 68.11 |
62
- | | | βœ“ | | | 70.82 |
63
- | | βœ“ | βœ“ | βœ“ | βœ“ | **71.23** |
64
- | **Llama3-8B** | βœ“ | βœ“ | βœ“ | βœ“ | **74.84** |
65
-
66
- **Table 2:** The accuracy of webpage understanding under various data configurations and LLM backbones.
67
- All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
68
-
69
 
70
  ### About CrystalChat-7B-Web2Code:
71
  * 7 billion parameter LLM
 
31
 
32
  ## Evaluations
33
 
34
+ ## Webpage Understanding Benchmark (WUB)
35
+ ### Results
36
+ | LLM Backbone | DWCG | DWU | DWCG<sub>R</sub> | DWU<sub>R</sub> | Accuracy (%) |
37
+ |------------------------|------|-----|------------------|------------------|--------------|
38
+ | **CrystalChat-7B** | | | | | 73.94 |
39
+ | | βœ“ | βœ“ | | | 73.48 |
40
+ | | βœ“ | βœ“ | βœ“ | βœ“ | **74.14** |
41
+ | **Vicuna-7B** | | | | | 71.12 |
42
+ | | βœ“ | | | | 68.11 |
43
+ | | | βœ“ | | | 70.82 |
44
+ | | βœ“ | βœ“ | βœ“ | βœ“ | **71.23** |
45
+ | **Llama3-8B** | βœ“ | βœ“ | βœ“ | βœ“ | **74.84** |
46
+
47
+ **Table 1:** The accuracy of webpage understanding under various data configurations and LLM backbones.
48
+ All models are instruction-tuned and evaluated on our WUB benchmark. We note that the general domain data (i.e., LLaVA) is included in all data configuration as default.
49
+
50
  ## Webpage Code Generation Benchmark (WCGB)
51
 
52
  Utilizing the same images as the WUB, this benchmark evaluates a multimodal model tasked with generating HTML code from webpage images based on specific instructions. Unlike traditionalcode-level evaluations, this benchmark assesses the generated webpage’s fidelity at the image level. We convert the predicted HTML codes back into images using Selenium WebDriver to allow a direct visual comparison with the ground truth images. The evaluation, depicted on the left side of Figure 6, considers 10 different aspects, which are further categorized into four evaluation matrices using the GPT-4 Vision API.
 
64
  | | βœ“ | βœ“ | βœ“ | βœ“ | **7.876** | **7.687** | **7.267** | **7.563** | **7.598** |
65
  | **Llama3-8B** | βœ“ | βœ“ | βœ“ | βœ“ | **8.522** | **8.564** | **8.421** | **8.611** | **8.530** |
66
 
67
+ **Table 2:** The performance of different LLM backbones under various data configurations on our Webpage Code Generation Benchmark (WCGB). "VSA" denotes Visual Structure and Alignment, "CAD" represents Color and Aesthetic Design, "TCC" represents Textual and Content Consistency, and "UII" denotes User Interface and Interactivity.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ### About CrystalChat-7B-Web2Code:
70
  * 7 billion parameter LLM