czczup commited on
Commit
450aed5
·
verified ·
1 Parent(s): 8314959

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -30
README.md CHANGED
@@ -10,10 +10,7 @@ datasets:
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
- # Model Card for InternVL-Chat-V1-2
14
- <p align="center">
15
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
16
- </p>
17
 
18
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
19
 
@@ -45,18 +42,6 @@ For better training reproducibility, we follow the minimalist design and data ef
45
  - Learnable Component: ViT + MLP + LLM
46
  - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
47
 
48
-
49
- ## Released Models
50
-
51
- | Model | Vision Foundation Model | Release Date |Note |
52
- | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
53
- | InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
54
- | InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
55
- | InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
56
- | InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
57
-
58
-
59
-
60
  ## Performance
61
 
62
  \* Proprietary Model
@@ -75,7 +60,6 @@ For better training reproducibility, we follow the minimalist design and data ef
75
  - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
76
  - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
77
 
78
-
79
  ## Training Details
80
 
81
  ### Data Preparation
@@ -84,7 +68,6 @@ Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train Intern
84
 
85
  For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
86
 
87
-
88
  ### Training (Supervised Finetuning)
89
 
90
  We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
@@ -97,9 +80,6 @@ The hyperparameters used for finetuning are listed in the following table.
97
  | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
98
  | InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
99
 
100
-
101
-
102
-
103
  ## Model Usage
104
 
105
  We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
@@ -178,12 +158,3 @@ If you find this project useful in your research, please consider citing:
178
  ## License
179
 
180
  This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
181
-
182
- Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
183
-
184
- ## Acknowledgement
185
-
186
- InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
187
-
188
- ## Contributors
189
- Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
 
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
+ # InternVL-Chat-V1-2
 
 
 
14
 
15
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
16
 
 
42
  - Learnable Component: ViT + MLP + LLM
43
  - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
44
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ## Performance
46
 
47
  \* Proprietary Model
 
60
  - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
61
  - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
62
 
 
63
  ## Training Details
64
 
65
  ### Data Preparation
 
68
 
69
  For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
70
 
 
71
  ### Training (Supervised Finetuning)
72
 
73
  We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
 
80
  | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
81
  | InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
82
 
 
 
 
83
  ## Model Usage
84
 
85
  We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
 
158
  ## License
159
 
160
  This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.