steinhaug commited on
Commit
cd9865b
β€’
1 Parent(s): e39b60c

Create MiniGPT-4 .md

Browse files
Files changed (1) hide show
  1. MiniGPT-4 .md +156 -0
MiniGPT-4 .md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
2
+ [Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), [Xiang Li](https://xiangli.ac.cn), and [Mohamed Elhoseiny](https://www.mohamed-elhoseiny.com/). *Equal Contribution
3
+
4
+ **King Abdullah University of Science and Technology**
5
+
6
+ ## Online Demo
7
+
8
+ Click the image to chat with MiniGPT-4 around your images
9
+ [![demo](figs/online_demo.png)](https://minigpt-4.github.io)
10
+
11
+
12
+ ## Examples
13
+ | | |
14
+ :-------------------------:|:-------------------------:
15
+ ![find wild](figs/examples/wop_2.png) | ![write story](figs/examples/ad_2.png)
16
+ ![solve problem](figs/examples/fix_1.png) | ![write Poem](figs/examples/rhyme_1.png)
17
+
18
+ More examples can be found in the [project page](https://minigpt-4.github.io).
19
+
20
+
21
+
22
+ ## Introduction
23
+ - MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
24
+ - We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
25
+ - To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.
26
+ - The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
27
+ - MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.
28
+
29
+
30
+ ![overview](figs/overview.png)
31
+
32
+
33
+ ## Getting Started
34
+ ### Installation
35
+
36
+ **1. Prepare the code and the environment**
37
+
38
+ Git clone our repository, creating a python environment and ativate it via the following command
39
+
40
+ ```bash
41
+ git clone https://github.com/Vision-CAIR/MiniGPT-4.git
42
+ cd MiniGPT-4
43
+ conda env create -f environment.yml
44
+ conda activate minigpt4
45
+ ```
46
+
47
+
48
+ **2. Prepare the pretrained Vicuna weights**
49
+
50
+ The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
51
+ Please refer to our instruction [here](PrepareVicuna.md)
52
+ to prepare the Vicuna weights.
53
+ The final weights would be in a single folder with the following structure:
54
+
55
+ ```
56
+ vicuna_weights
57
+ β”œβ”€β”€ config.json
58
+ β”œβ”€β”€ generation_config.json
59
+ β”œβ”€β”€ pytorch_model.bin.index.json
60
+ β”œβ”€β”€ pytorch_model-00001-of-00003.bin
61
+ ...
62
+ ```
63
+
64
+ Then, set the path to the vicuna weight in the model config file
65
+ [here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.
66
+
67
+ **3. Prepare the pretrained MiniGPT-4 checkpoint**
68
+
69
+ To play with our pretrained model, download the pretrained checkpoint
70
+ [here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link).
71
+ Then, set the path to the pretrained checkpoint in the evaluation config file
72
+ in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 11.
73
+
74
+
75
+
76
+ ### Launching Demo Locally
77
+
78
+ Try out our demo [demo.py](demo.py) on your local machine by running
79
+
80
+ ```
81
+ python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
82
+ ```
83
+
84
+ Here, we load Vicuna as 8 bit by default to save some GPU memory usage.
85
+ Besides, the default beam search width is 1.
86
+ Under this setting, the demo cost about 23G GPU memory.
87
+ If you have a more powerful GPU with larger GPU memory, you can run the model
88
+ in 16 bit by setting low_resource to False in the config file
89
+ [minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml) and use a larger beam search width.
90
+
91
+
92
+ ### Training
93
+ The training of MiniGPT-4 contains two alignment stages.
94
+
95
+ **1. First pretraining stage**
96
+
97
+ In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets
98
+ to align the vision and language model. To download and prepare the datasets, please check
99
+ our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
100
+ After the first stage, the visual features are mapped and can be understood by the language
101
+ model.
102
+ To launch the first stage training, run the following command. In our experiments, we use 4 A100.
103
+ You can change the save path in the config file
104
+ [train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml)
105
+
106
+ ```bash
107
+ torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
108
+ ```
109
+
110
+ A MiniGPT-4 checkpoint with only stage one training can be downloaded
111
+ [here](https://drive.google.com/file/d/1u9FRRBB3VovP1HxCAlpD9Lw4t4P6-Yq8/view?usp=share_link).
112
+ Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.
113
+
114
+
115
+ **2. Second finetuning stage**
116
+
117
+ In the second stage, we use a small high quality image-text pair dataset created by ourselves
118
+ and convert it to a conversation format to further align MiniGPT-4.
119
+ To download and prepare our second stage dataset, please check our
120
+ [second stage dataset preparation instruction](dataset/README_2_STAGE.md).
121
+ To launch the second stage alignment,
122
+ first specify the path to the checkpoint file trained in stage 1 in
123
+ [train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml).
124
+ You can also specify the output path there.
125
+ Then, run the following command. In our experiments, we use 1 A100.
126
+
127
+ ```bash
128
+ torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
129
+ ```
130
+
131
+ After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
132
+
133
+
134
+
135
+
136
+ ## Acknowledgement
137
+
138
+ + [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
139
+ + [Lavis](https://github.com/salesforce/LAVIS) This repository is built upon Lavis!
140
+ + [Vicuna](https://github.com/lm-sys/FastChat) The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
141
+
142
+
143
+ If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
144
+ ```bibtex
145
+ @misc{zhu2022minigpt4,
146
+ title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models},
147
+ author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
148
+ year={2023},
149
+ }
150
+ ```
151
+
152
+
153
+ ## License
154
+ This repository is under [BSD 3-Clause License](LICENSE.md).
155
+ Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with
156
+ BSD 3-Clause License [here](LICENSE_Lavis.md).