doubility123 commited on
Commit
2232352
·
verified ·
1 Parent(s): 13f3cf9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -1
README.md CHANGED
@@ -1,3 +1,125 @@
1
  ---
2
- license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: other
3
+ license_name: deepseek
4
+ license_link: LICENSE
5
  ---
6
+
7
+ ## 1. Introduction
8
+
9
+ Introducing DeepSeek VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
10
+
11
+ [DeepSeek-VL: Towards Real-World Vision-Language Understanding](https://arxiv.org/abs/2403.05525)
12
+
13
+ Haoyu Lu*, Wen Liu*, Bo Zhang**, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan (*Equal Contribution, **Project Leader)
14
+
15
+ ![](https://github.com/deepseek-ai/DeepSeek-VL/blob/main/images/sample.jpg)
16
+
17
+
18
+ ### 2. Model Summary
19
+
20
+ DeepSeek-VL-7b-base uses the [SigLIP-L](https://huggingface.co/timm/ViT-L-16-SigLIP-384) and [SAM-B](https://huggingface.co/facebook/sam-vit-base) as the hybrid vision encoder supporting 1024 x 1024 image input
21
+ and is constructed based on the DeepSeek-LLM-7b-base which is trained on an approximate corpus of 2T text tokens. The whole DeepSeek-VL-7b-base model is finally trained around 400B vision-language tokens.
22
+ DeekSeel-VL-7b-chat is an instructed version based on [DeepSeek-VL-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-base).
23
+
24
+
25
+ ## 3. Quick Start
26
+
27
+ ### Installation
28
+
29
+ On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
30
+
31
+
32
+ ```shell
33
+ git clone https://github.com/deepseek-ai/DeepSeek-VL
34
+ cd DeepSeek-VL
35
+
36
+ pip install -r requirements.txt -e .
37
+ ```
38
+
39
+ ### Simple Inference Example
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoModelForCausalLM
44
+
45
+ from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
46
+ from deepseek_vl.utils.io import load_pil_images
47
+
48
+
49
+ # specify the path to the model
50
+ model_path = "deepseek-ai/deepseek-vl-7b-chat"
51
+ vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
52
+ tokenizer = vl_chat_processor.tokenizer
53
+
54
+ vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
55
+ vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
56
+
57
+ conversation = [
58
+ {
59
+ "role": "User",
60
+ "content": "<image_placeholder>Describe each stage of this image.",
61
+ "images": ["./images/training_pipelines.png"]
62
+ },
63
+ {
64
+ "role": "Assistant",
65
+ "content": ""
66
+ }
67
+ ]
68
+
69
+ # load images and prepare for inputs
70
+ pil_images = load_pil_images(conversation)
71
+ prepare_inputs = vl_chat_processor(
72
+ conversations=conversation,
73
+ images=pil_images,
74
+ force_batchify=True
75
+ ).to(vl_gpt.device)
76
+
77
+ # run image encoder to get the image embeddings
78
+ inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
79
+
80
+ # run the model to get the response
81
+ outputs = vl_gpt.language_model.generate(
82
+ inputs_embeds=inputs_embeds,
83
+ attention_mask=prepare_inputs.attention_mask,
84
+ pad_token_id=tokenizer.eos_token_id,
85
+ bos_token_id=tokenizer.bos_token_id,
86
+ eos_token_id=tokenizer.eos_token_id,
87
+ max_new_tokens=512,
88
+ do_sample=False,
89
+ use_cache=True
90
+ )
91
+
92
+ answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
93
+ print(f"{prepare_inputs['sft_format'][0]}", answer)
94
+ ```
95
+
96
+ ### CLI Chat
97
+ ```bash
98
+
99
+ python cli_chat.py --model_path "deepseek-ai/deepseek-vl-7b-chat"
100
+
101
+ # or local path
102
+ python cli_chat.py --model_path "local model path"
103
+
104
+ ```
105
+
106
+ ## 4. License
107
+
108
+ This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). The use of DeepSeek LLM Base/Chat models is subject to [the Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL). DeepSeek LLM series (including Base and Chat) supports commercial use.
109
+
110
+ ## 5. Citation
111
+
112
+ ```
113
+ @misc{lu2024deepseekvl,
114
+ title={DeepSeek-VL: Towards Real-World Vision-Language Understanding},
115
+ author={Haoyu Lu and Wen Liu and Bo Zhang and Bingxuan Wang and Kai Dong and Bo Liu and Jingxiang Sun and Tongzheng Ren and Zhuoshu Li and Yaofeng Sun and Chengqi Deng and Hanwei Xu and Zhenda Xie and Chong Ruan},
116
+ year={2024},
117
+ eprint={2403.05525},
118
+ archivePrefix={arXiv},
119
+ primaryClass={cs.AI}
120
+ }
121
+ ```
122
+
123
+ ## 6. Contact
124
+
125
+ If you have any questions, please raise an issue or contact us at [[email protected]](mailto:[email protected]).