Update README.md
Browse files
README.md
CHANGED
@@ -10,10 +10,7 @@ datasets:
|
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
-
#
|
14 |
-
<p align="center">
|
15 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
|
16 |
-
</p>
|
17 |
|
18 |
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
|
19 |
|
@@ -45,18 +42,6 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
45 |
- Learnable Component: ViT + MLP + LLM
|
46 |
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
47 |
|
48 |
-
|
49 |
-
## Released Models
|
50 |
-
|
51 |
-
| Model | Vision Foundation Model | Release Date |Note |
|
52 |
-
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
53 |
-
| InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
54 |
-
| InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
|
55 |
-
| InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
|
56 |
-
| InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
## Performance
|
61 |
|
62 |
\* Proprietary Model
|
@@ -75,7 +60,6 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
75 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
76 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
77 |
|
78 |
-
|
79 |
## Training Details
|
80 |
|
81 |
### Data Preparation
|
@@ -84,7 +68,6 @@ Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train Intern
|
|
84 |
|
85 |
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
86 |
|
87 |
-
|
88 |
### Training (Supervised Finetuning)
|
89 |
|
90 |
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
@@ -97,9 +80,6 @@ The hyperparameters used for finetuning are listed in the following table.
|
|
97 |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
98 |
| InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
|
99 |
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
## Model Usage
|
104 |
|
105 |
We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
|
@@ -178,12 +158,3 @@ If you find this project useful in your research, please consider citing:
|
|
178 |
## License
|
179 |
|
180 |
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
|
181 |
-
|
182 |
-
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
183 |
-
|
184 |
-
## Acknowledgement
|
185 |
-
|
186 |
-
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
187 |
-
|
188 |
-
## Contributors
|
189 |
-
Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
|
|
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
+
# InternVL-Chat-V1-2
|
|
|
|
|
|
|
14 |
|
15 |
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
|
16 |
|
|
|
42 |
- Learnable Component: ViT + MLP + LLM
|
43 |
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
## Performance
|
46 |
|
47 |
\* Proprietary Model
|
|
|
60 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
61 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
62 |
|
|
|
63 |
## Training Details
|
64 |
|
65 |
### Data Preparation
|
|
|
68 |
|
69 |
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
70 |
|
|
|
71 |
### Training (Supervised Finetuning)
|
72 |
|
73 |
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
|
|
80 |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
81 |
| InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
|
82 |
|
|
|
|
|
|
|
83 |
## Model Usage
|
84 |
|
85 |
We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
|
|
|
158 |
## License
|
159 |
|
160 |
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|