qazimbhat1
/

Crystal-based-MLLM-7B

@@ -10,46 +10,43 @@ tags:
 - mllm
 ---
-# CrystalChat-7B-MLLM: a fully-reproducible vision large language model based on CrystalChat-7B LLM
-Crystal-based models mimics the training recipie used for Vicuna 7B in LLaVA muli... (MLLM). Crystal-based models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more.
-| LLM Backbone                      | MME-P   | MME-C  | POPE  | SciQA  | TextVQA |
-|-----------------------------------|---------|--------|-------|--------|---------|
-| CrystalCoder-7B | 1359.83 | 238.92 | 86.18 | 64.15 | 50.39   |
-| CrystalChat-7B  | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
-| Vicuna-7B | **1481.12** | 302.85 | **87.17** | **67.97** | 56.49   |
-*Table: Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
-## About Crystal:
 * 7 billion parameter LLM
-* CLIP ViT-L/14
-* Tokens: ????
 * Languages: English
-* Models Released: ???? model
 * Trained in 2 stages
 * License: ?
-Crystal-based models were developed as a collaboration between [MBZUAI](https://mbzuai.ac.ae/institute-of-foundation-models/), [Petuum](https://www.petuum.com/), and [LLM360](https://www.llm360.ai/)????.
 ## Evaluation
 General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
 aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.
-<center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>
-## Datasets and Mix
 ### Pretrain Data
 LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
-### Finetune
 The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language.
@@ -69,12 +66,10 @@ The dataset chosen was created by LLaVA with academic-task-oriented VQA data mix
 | VG [25]       | 86K  | Provide the bounding box coordinate of the region this sentence describes. |
 | **Total**     | **665K** |                                                                      |
-**Table 7. Instruction-following Data Mixture of LLaVA-1.5.**
-# LLM360 Research Suite
 ## Stage 2 - Finetuning
 | Checkpoints      |  |
 | ----------- | ----------- |
@@ -87,9 +82,16 @@ The dataset chosen was created by LLaVA with academic-task-oriented VQA data mix
 | [CrystalChat](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-based-MLLM-7B-pretrain)        |
 | [CrystalCoder](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-coder-7B-pretrain)        |
 [to find all branches: git branch -a]
-# Loading Crystal
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -118,6 +120,7 @@ We believe in a future where artificial general intelligence (AGI) is created by
 [Visit us](https://www.llm360.ai/)
 ## Citation
 **BibTeX:**

 - mllm
 ---
+# CrystalChat-7B-MLLM: a fully-reproducible vision language model based on CrystalChat-7B
+## Model Description
+CrystalChat-7B based multi-modal large language model (MLLM) mimics the training recipe used for Vicuna-7B based [LLaVa-v1.5](https://huggingface.co/docs/transformers/main/model_doc/llava). CrystalChat-7B-MLLM models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more at [TODO: Add paper link]().
+### About CrystalChat-7B-MLLM:
 * 7 billion parameter LLM
+* CLIP ViT-L/14-336px vision encoder
 * Languages: English
+* Models Released: CrystalChat-7B-MLLM
 * Trained in 2 stages
 * License: ?
+Crystal-based models were developed as a collaboration between [MBZUAI](https://mbzuai.ac.ae/institute-of-foundation-models/), [Petuum](https://www.petuum.com/), and [LLM360](https://www.llm360.ai/) TODO- check????.
 ## Evaluation
 General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
 aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.
+| LLM Backbone                      | MME-P   | MME-C  | POPE  | SciQA  | TextVQA |
+|-----------------------------------|---------|--------|-------|--------|---------|
+| CrystalCoder-7B | 1359.83 | 238.92 | 86.18 | 64.15 | 50.39   |
+| CrystalChat-7B  | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
+| Vicuna-7B | **1481.12** | 302.85 | **87.17** | **67.97** | 56.49   |
+*Table 1: Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*
+## Data and Training Details
 ### Pretrain Data
 LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.
+### Finetune Data
 The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language.
 | VG [25]       | 86K  | Provide the bounding box coordinate of the region this sentence describes. |
 | **Total**     | **665K** |                                                                      |
+*Table 2. Instruction-following Data Mixture of LLaVA-1.5.*
+TODO: Check if we need to publish these 2
 ## Stage 2 - Finetuning
 | Checkpoints      |  |
 | ----------- | ----------- |
 | [CrystalChat](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-based-MLLM-7B-pretrain)        |
 | [CrystalCoder](https://huggingface.co/qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-coder-7B-pretrain)        |
 [to find all branches: git branch -a]
+## Examples
+TODO: Add image as sample example
+<center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>
+## Loading Crystal
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 [Visit us](https://www.llm360.ai/)
 ## Citation
 **BibTeX:**