qazimbhat1 commited on
Commit
bfb2d4e
1 Parent(s): d24fa21

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -24
README.md CHANGED
@@ -10,46 +10,43 @@ tags:
10
  - mllm
11
  ---
12
 
13
- # CrystalChat-7B-MLLM: a fully-reproducible vision large language model based on CrystalChat-7B LLM
14
 
15
- Crystal-based models mimics the training recipie used for Vicuna 7B in LLaVA muli... (MLLM). Crystal-based models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more.
16
 
17
- | LLM Backbone | MME-P | MME-C | POPE | SciQA | TextVQA |
18
- |-----------------------------------|---------|--------|-------|--------|---------|
19
- | CrystalCoder-7B | 1359.83 | 238.92 | 86.18 | 64.15 | 50.39 |
20
- | CrystalChat-7B | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
21
- | Vicuna-7B | **1481.12** | 302.85 | **87.17** | **67.97** | 56.49 |
22
-
23
- *Table: Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
24
 
25
-
26
- ## About Crystal:
27
  * 7 billion parameter LLM
28
- * CLIP ViT-L/14
29
- * Tokens: ????
30
  * Languages: English
31
- * Models Released: ???? model
32
  * Trained in 2 stages
33
  * License: ?
34
 
35
- Crystal-based models were developed as a collaboration between [MBZUAI](https://mbzuai.ac.ae/institute-of-foundation-models/), [Petuum](https://www.petuum.com/), and [LLM360](https://www.llm360.ai/)????.
 
36
 
37
  ## Evaluation
38
 
39
  General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
40
  aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.
41
 
 
 
 
 
 
42
 
43
- <center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>
44
-
45
 
46
 
47
- ## Datasets and Mix
48
 
49
  ### Pretrain Data
50
  LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
51
 
52
- ### Finetune
53
 
54
  The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language.
55
 
@@ -69,12 +66,10 @@ The dataset chosen was created by LLaVA with academic-task-oriented VQA data mix
69
  | VG [25] | 86K | Provide the bounding box coordinate of the region this sentence describes. |
70
  | **Total** | **665K** | |
71
 
72
- **Table 7. Instruction-following Data Mixture of LLaVA-1.5.**
73
 
74
 
75
-
76
- # LLM360 Research Suite
77
-
78
  ## Stage 2 - Finetuning
79
  | Checkpoints | |
80
  | ----------- | ----------- |
@@ -87,9 +82,16 @@ The dataset chosen was created by LLaVA with academic-task-oriented VQA data mix
87
  | [CrystalChat](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-based-MLLM-7B-pretrain) |
88
  | [CrystalCoder](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-coder-7B-pretrain) |
89
 
 
90
  [to find all branches: git branch -a]
91
 
92
- # Loading Crystal
 
 
 
 
 
 
93
  ```python
94
  from transformers import AutoModelForCausalLM, AutoTokenizer
95
 
@@ -118,6 +120,7 @@ We believe in a future where artificial general intelligence (AGI) is created by
118
 
119
  [Visit us](https://www.llm360.ai/)
120
 
 
121
  ## Citation
122
 
123
  **BibTeX:**
 
10
  - mllm
11
  ---
12
 
13
+ # CrystalChat-7B-MLLM: a fully-reproducible vision language model based on CrystalChat-7B
14
 
15
+ ## Model Description
16
 
17
+ CrystalChat-7B based multi-modal large language model (MLLM) mimics the training recipe used for Vicuna-7B based [LLaVa-v1.5](https://huggingface.co/docs/transformers/main/model_doc/llava). CrystalChat-7B-MLLM models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more at [TODO: Add paper link]().
 
 
 
 
 
 
18
 
19
+ ### About CrystalChat-7B-MLLM:
 
20
  * 7 billion parameter LLM
21
+ * CLIP ViT-L/14-336px vision encoder
 
22
  * Languages: English
23
+ * Models Released: CrystalChat-7B-MLLM
24
  * Trained in 2 stages
25
  * License: ?
26
 
27
+ Crystal-based models were developed as a collaboration between [MBZUAI](https://mbzuai.ac.ae/institute-of-foundation-models/), [Petuum](https://www.petuum.com/), and [LLM360](https://www.llm360.ai/) TODO- check????.
28
+
29
 
30
  ## Evaluation
31
 
32
  General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
33
  aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.
34
 
35
+ | LLM Backbone | MME-P | MME-C | POPE | SciQA | TextVQA |
36
+ |-----------------------------------|---------|--------|-------|--------|---------|
37
+ | CrystalCoder-7B | 1359.83 | 238.92 | 86.18 | 64.15 | 50.39 |
38
+ | CrystalChat-7B | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
39
+ | Vicuna-7B | **1481.12** | 302.85 | **87.17** | **67.97** | 56.49 |
40
 
41
+ *Table 1: Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
 
42
 
43
 
44
+ ## Data and Training Details
45
 
46
  ### Pretrain Data
47
  LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
48
 
49
+ ### Finetune Data
50
 
51
  The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language.
52
 
 
66
  | VG [25] | 86K | Provide the bounding box coordinate of the region this sentence describes. |
67
  | **Total** | **665K** | |
68
 
69
+ *Table 2. Instruction-following Data Mixture of LLaVA-1.5.*
70
 
71
 
72
+ TODO: Check if we need to publish these 2
 
 
73
  ## Stage 2 - Finetuning
74
  | Checkpoints | |
75
  | ----------- | ----------- |
 
82
  | [CrystalChat](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-based-MLLM-7B-pretrain) |
83
  | [CrystalCoder](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-coder-7B-pretrain) |
84
 
85
+
86
  [to find all branches: git branch -a]
87
 
88
+ ## Examples
89
+
90
+ TODO: Add image as sample example
91
+ <center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>
92
+
93
+
94
+ ## Loading Crystal
95
  ```python
96
  from transformers import AutoModelForCausalLM, AutoTokenizer
97
 
 
120
 
121
  [Visit us](https://www.llm360.ai/)
122
 
123
+
124
  ## Citation
125
 
126
  **BibTeX:**