File size: 6,151 Bytes
a3abc41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5d321d
a3abc41
f5d321d
a3abc41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- nlp
- llm
- mllm
---

# CrystalChat-7B-MLLM: a fully-reproducible vision large language model based on CrystalChat-7B LLM

Crystal-based models mimics the training recipie used for Vicuna 7B in LLaVA muli... (MLLM). Crystal-based models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more.

| LLM Backbone                      | MME-P   | MME-C  | POPE  | SciQA  | TextVQA |
|-----------------------------------|---------|--------|-------|--------|---------|
| CrystalCoder-7B | 1359.83 | 238.92 | 86.18 | 64.15 | 50.39   |
| CrystalChat-7B  | 1456.53 | **308.21** | 86.96 | 67.77 | **57.84** |
| Vicuna-7B | **1481.12** | 302.85 | **87.17** | **67.97** | 56.49   |

*Table: Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)*


## About Crystal:
* 7 billion parameter LLM
* CLIP ViT-L/14
* Tokens: ????
* Languages: English
* Models Released: ???? model
* Trained in 2 stages
* License: ?

Crystal-based models were developed as a collaboration between [MBZUAI](https://mbzuai.ac.ae/institute-of-foundation-models/), [Petuum](https://www.petuum.com/), and [LLM360](https://www.llm360.ai/)????.

## Evaluation

General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark,
aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE. 


<center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>



## Datasets and Mix

### Pretrain Data
LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.

### Finetune

The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language. 

<!-- The full data sequence can be found [here](https://huggingface.co./datasets/liuhaotian/LLaVA-Instruct-150K) -->

| Data          | Size | Response formatting prompts                                              |
|---------------|------|--------------------------------------------------------------------------|
| LLaVA [36]    | 158K | –                                                                        |
| ShareGPT [46] | 40K  | –                                                                        |
| VQAv2 [19]    | 83K  | Answer the question using a single word or phrase.                       |
| GQA [21]      | 72K  | Answer the question using a single word or phrase.                       |
| OKVQA [41]    | 9K   | Answer the question using a single word or phrase.                       |
| OCRVQA [42]   | 80K  | Answer the question using a single word or phrase.                       |
| A-OKVQA [45]  | 66K  | Answer with the option’s letter from the given choices directly.         |
| TextCaps [47] | 22K  | Provide a one-sentence caption for the provided image.                   |
| RefCOCO [24, 40] | 48K  | Note: randomly choose between the two formats. Provide a short description for this region. |
| VG [25]       | 86K  | Provide the bounding box coordinate of the region this sentence describes. |
| **Total**     | **665K** |                                                                      |

**Table 7. Instruction-following Data Mixture of LLaVA-1.5.**



# LLM360 Research Suite

## Stage 2 - Finetuning
| Checkpoints      |  |
| ----------- | ----------- |
| [CrystalChat](https://huggingface.co./qazimbhat1/my-model-repo3/tree/main)        |
| [CrystalCoder](https://huggingface.co./qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-coder-7B)        |

## Stage 1 - Pretraining
| Checkpoints      |  |
| ----------- | ----------- |
| [CrystalChat](https://huggingface.co./qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-based-MLLM-7B-pretrain)        |
| [CrystalCoder](https://huggingface.co./qazimbhat1/Crystal-based-MLLM-7B/tree/Crystal-coder-7B-pretrain)        |

[to find all branches: git branch -a]

# Loading Crystal
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
                    "LLM360/CrystalChat-7B-MLLM",
                    padding_side="right",
                    trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "LLM360/CrystalChat-7B-MLLM",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map='auto',
    low_cpu_mem_usage=True
)
```



## LLM-360
LLM-360 is an open research lab enabling community-owned AGI through open-source large model research and development.

Crystal-based Models enables community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.

We believe in a future where artificial general intelligence (AGI) is created by the community, for the community. Through an open ecosystem of equitable computational resources, high-quality data, and flowing technical knowledge, we can ensure ethical AGI development and universal access for all innovators.

[Visit us](https://www.llm360.ai/)

## Citation

**BibTeX:**

```bibtex
@article{
      title={}, 
      author={},
      year={},
}
```