kimyoungjune commited on
Commit
1659466
ยท
verified ยท
1 Parent(s): 277590a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -191
README.md CHANGED
@@ -1,192 +1,197 @@
1
- ---
2
- language:
3
- - en
4
- - ko
5
- license: cc-by-nc-4.0
6
- tags:
7
- - multimodal
8
- - conversational
9
- - ncsoft
10
- - varco
11
- base_model:
12
- - Qwen/Qwen2.5-14B-Instruct
13
- - google/siglip-so400m-patch14-384
14
- library_name: transformers
15
- ---
16
-
17
- # VARCO-VISION-14B-HF
18
-
19
- ## About the Model
20
-
21
- **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports groundingโ€”the ability to identify the locations of objects within an imageโ€”as well as OCR (Optical Character Recognition) to recognize text within images.
22
-
23
- - **Developed by:** NC Research, Multimodal Generation Team
24
- - **Technical Report:** [Coming Soon]()
25
- - **Demo Page:** [Coming Soon]()
26
- - **Languages:** Korean, English
27
- - **License:** CC BY-NC 4.0
28
- - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
29
- - **Base Model:**
30
- - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
31
- - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
32
- - **LLaVA-NeXT Codebase Model:** [NCSOFT/VARCO-VISION-14B](https://huggingface.co/NCSOFT/VARCO-VISION-14B)
33
-
34
-
35
-
36
- ## Uses
37
-
38
- ### Direct Use
39
- To use this model, ensure you have `transformers >= 4.45.0` installed.
40
-
41
- ```python
42
- import requests
43
- from PIL import Image
44
- from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
45
-
46
- model_name = "NCSOFT/VARCO-VISION-14B-HF"
47
- model = LlavaOnevisionForConditionalGeneration.from_pretrained(
48
- model_name,
49
- torch_dtype="float16",
50
- device_map="auto",
51
- attn_implementation="flash_attention_2"
52
- )
53
- processor = AutoProcessor.from_pretrained(model_name)
54
- device = model.device
55
-
56
- # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
57
- # Each value in "content" has to be a list of dicts with types ("text", "image")
58
- conversation = [
59
- {
60
- "role": "user",
61
- "content": [
62
- {"type": "text", "text": "Describe this image."},
63
- {"type": "image"},
64
- ],
65
- },
66
- ]
67
-
68
- prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
69
-
70
- EOS_TOKEN = "<|im_end|>"
71
- image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
72
- raw_image = Image.open(requests.get(image_file, stream=True).raw)
73
- inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(device)
74
-
75
- output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
76
- output = processor.decode(output[0][inputs.input_ids.shape[1]:])
77
- if output.endswith(EOS_TOKEN):
78
- output = output[: -len(EOS_TOKEN)]
79
-
80
- output = output.strip()
81
- print(output)
82
- ```
83
-
84
- ### Specialized Features
85
-
86
- To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
87
-
88
- The following special tokens are used to define specific tasks, inputs and outputs for the model:
89
-
90
- - `<gro>`: Indicates that the model's response should include bounding box information.
91
- - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
92
- - `<char>` and `</char>`: Used to mark a text phrase.
93
- - `<obj>` and `</obj>`: Used to indicate an object.
94
- - `<bbox>` and `</bbox>`: Used to represent a bounding box.
95
- - `<delim>`: Represents multiple location points for a single object or text.
96
-
97
- #### Grounding
98
-
99
- Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
100
-
101
- ```python
102
- conversation = [
103
- {
104
- "role": "user",
105
- "content": [
106
- {"type": "text", "text": "<gro>\nDescribe the image in detail."},
107
- {"type": "image"},
108
- ],
109
- },
110
- ]
111
- ```
112
-
113
- **Expected Output Example:**
114
- ```html
115
- The image shows <obj>two cats</obj><bbox>0.014, 0.106, 0.51, 0.996<delim>0.51, 0.054, 0.996, 0.787</bbox> lying on <obj>a pink blanket</obj><bbox>0.003, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket, while the cat on the right is lying on its stomach with its head also resting on the blanket. Both cats appear to be relaxed and comfortable. There are <obj>two remote controls</obj><bbox>0.037, 0.141, 0.283, 0.253<delim>0.506, 0.171, 0.581, 0.295</bbox> placed near the cats, one on the left side and one on the right side of the image.
116
- ```
117
-
118
- <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
119
-
120
- #### Referring
121
-
122
- VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
123
-
124
- ```python
125
- conversation = [
126
- {
127
- "role": "user",
128
- "content": [
129
- {
130
- "type": "text",
131
- "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
132
- },
133
- {"type": "image"},
134
- ],
135
- },
136
- ]
137
- ```
138
-
139
- **Expected Output Example:**
140
- ```
141
- **์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์—๋Š” ๋‹ค์–‘ํ•œ ๋ฒ„ํŠผ์ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๋ฒ„ํŠผ์€ ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ๋ฆฌ๋ชจ์ปจ์„ ์†์— ๋“ค๊ณ  ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด, ํ•ด๋‹น ๊ธฐ๊ธฐ์— ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ด ์›ํ•˜๋Š” ์กฐ์ž‘์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€์ •์ด๋‚˜ ์‚ฌ๋ฌด์‹ค์—์„œ ํŽธ๋ฆฌํ•˜๊ฒŒ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
142
- ```
143
-
144
- #### OCR
145
-
146
- To perform Optical Character Recognition (OCR), use the `<ocr>` token.
147
-
148
- ```python
149
- image_file = "./assets/ocr_1.png"
150
- raw_image = Image.open(image_file)
151
-
152
- conversation = [
153
- {
154
- "role": "user",
155
- "content": [
156
- {"type": "text", "text": "<ocr>"},
157
- {"type": "image"},
158
- ],
159
- },
160
- ]
161
- ```
162
-
163
- **Expected Output Example:**
164
-
165
- ```
166
- <char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.266, 0.328, 0.341</bbox>
167
- <char>124๋ฒˆ๊ธธ</char><bbox>0.347, 0.266, 0.512, 0.341</bbox>
168
- <char>Baekbeom-ro</char><bbox>0.171, 0.337, 0.433, 0.392</bbox>
169
- <char>124</char><bbox>0.444, 0.341, 0.508, 0.392</bbox>
170
- <char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.531, 0.335, 0.601</bbox>
171
- <char>์‹œํฅ</char><bbox>0.443, 0.518, 0.522, 0.581</bbox>
172
- <char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
173
- <char>Mansu</char><bbox>0.102, 0.601, 0.181, 0.648</bbox>
174
- <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
175
- <char>Apt</char><bbox>0.28, 0.601, 0.327, 0.651</bbox>
176
- <char>42</char><bbox>0.377, 0.601, 0.416, 0.648</bbox>
177
- <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.625</bbox>
178
- <char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.43, 0.621, 0.609, 0.684</bbox>
179
- <char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.59, 0.873, 0.665</bbox>
180
- <char>IncheonGrand</char><bbox>0.432, 0.681, 0.561, 0.723</bbox>
181
- <char>Park</char><bbox>0.564, 0.681, 0.611, 0.723</bbox>
182
- ```
183
-
184
- <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
185
-
186
- ## Citing the Model
187
-
188
- (*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
189
-
190
- ```bibtex
191
-
 
 
 
 
 
192
  ```
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ko
5
+ license: cc-by-nc-4.0
6
+ tags:
7
+ - multimodal
8
+ - conversational
9
+ - ncsoft
10
+ - varco
11
+ base_model:
12
+ - Qwen/Qwen2.5-14B-Instruct
13
+ - google/siglip-so400m-patch14-384
14
+ library_name: transformers
15
+ ---
16
+
17
+ # VARCO-VISION-14B-HF
18
+
19
+ ## About the Model
20
+
21
+ **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports groundingโ€”the ability to identify the locations of objects within an imageโ€”as well as OCR (Optical Character Recognition) to recognize text within images.
22
+
23
+ - **Developed by:** NC Research, Multimodal Generation Team
24
+ - **Technical Report:** [Coming Soon]()
25
+ - **Demo Page:** [Coming Soon]()
26
+ - **Languages:** Korean, English
27
+ - **License:** CC BY-NC 4.0
28
+ - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
29
+ - **Base Model:**
30
+ - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
31
+ - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
32
+ - **LLaVA-NeXT Codebase Model:** [NCSOFT/VARCO-VISION-14B](https://huggingface.co/NCSOFT/VARCO-VISION-14B)
33
+ - **Korean VLM Test Sets:**
34
+ - [NCSOFT/K-MMBench](https://huggingface.co/datasets/NCSOFT/K-MMBench)
35
+ - [NCSOFT/K-SEED](https://huggingface.co/datasets/NCSOFT/K-SEED)
36
+ - [NCSOFT/K-MMStar](https://huggingface.co/datasets/NCSOFT/K-MMStar)
37
+ - [NCSOFT/K-DTCBench](https://huggingface.co/datasets/NCSOFT/K-DTCBench)
38
+ - [NCSOFT/K-LLaVA-W](https://huggingface.co/datasets/NCSOFT/K-LLaVA-W)
39
+
40
+
41
+ ## Uses
42
+
43
+ ### Direct Use
44
+ To use this model, ensure you have `transformers >= 4.45.0` installed.
45
+
46
+ ```python
47
+ import requests
48
+ from PIL import Image
49
+ from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
50
+
51
+ model_name = "NCSOFT/VARCO-VISION-14B-HF"
52
+ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
53
+ model_name,
54
+ torch_dtype="float16",
55
+ device_map="auto",
56
+ attn_implementation="flash_attention_2"
57
+ )
58
+ processor = AutoProcessor.from_pretrained(model_name)
59
+ device = model.device
60
+
61
+ # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
62
+ # Each value in "content" has to be a list of dicts with types ("text", "image")
63
+ conversation = [
64
+ {
65
+ "role": "user",
66
+ "content": [
67
+ {"type": "text", "text": "Describe this image."},
68
+ {"type": "image"},
69
+ ],
70
+ },
71
+ ]
72
+
73
+ prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
74
+
75
+ EOS_TOKEN = "<|im_end|>"
76
+ image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
77
+ raw_image = Image.open(requests.get(image_file, stream=True).raw)
78
+ inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(device)
79
+
80
+ output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
81
+ output = processor.decode(output[0][inputs.input_ids.shape[1]:])
82
+ if output.endswith(EOS_TOKEN):
83
+ output = output[: -len(EOS_TOKEN)]
84
+
85
+ output = output.strip()
86
+ print(output)
87
+ ```
88
+
89
+ ### Specialized Features
90
+
91
+ To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
92
+
93
+ The following special tokens are used to define specific tasks, inputs and outputs for the model:
94
+
95
+ - `<gro>`: Indicates that the model's response should include bounding box information.
96
+ - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
97
+ - `<char>` and `</char>`: Used to mark a text phrase.
98
+ - `<obj>` and `</obj>`: Used to indicate an object.
99
+ - `<bbox>` and `</bbox>`: Used to represent a bounding box.
100
+ - `<delim>`: Represents multiple location points for a single object or text.
101
+
102
+ #### Grounding
103
+
104
+ Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
105
+
106
+ ```python
107
+ conversation = [
108
+ {
109
+ "role": "user",
110
+ "content": [
111
+ {"type": "text", "text": "<gro>\nDescribe the image in detail."},
112
+ {"type": "image"},
113
+ ],
114
+ },
115
+ ]
116
+ ```
117
+
118
+ **Expected Output Example:**
119
+ ```html
120
+ The image shows <obj>two cats</obj><bbox>0.014, 0.106, 0.51, 0.996<delim>0.51, 0.054, 0.996, 0.787</bbox> lying on <obj>a pink blanket</obj><bbox>0.003, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket, while the cat on the right is lying on its stomach with its head also resting on the blanket. Both cats appear to be relaxed and comfortable. There are <obj>two remote controls</obj><bbox>0.037, 0.141, 0.283, 0.253<delim>0.506, 0.171, 0.581, 0.295</bbox> placed near the cats, one on the left side and one on the right side of the image.
121
+ ```
122
+
123
+ <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
124
+
125
+ #### Referring
126
+
127
+ VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
128
+
129
+ ```python
130
+ conversation = [
131
+ {
132
+ "role": "user",
133
+ "content": [
134
+ {
135
+ "type": "text",
136
+ "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
137
+ },
138
+ {"type": "image"},
139
+ ],
140
+ },
141
+ ]
142
+ ```
143
+
144
+ **Expected Output Example:**
145
+ ```
146
+ **์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์—๋Š” ๋‹ค์–‘ํ•œ ๋ฒ„ํŠผ์ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๋ฒ„ํŠผ์€ ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ๋ฆฌ๋ชจ์ปจ์„ ์†์— ๋“ค๊ณ  ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด, ํ•ด๋‹น ๊ธฐ๊ธฐ์— ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ด ์›ํ•˜๋Š” ์กฐ์ž‘์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€์ •์ด๋‚˜ ์‚ฌ๋ฌด์‹ค์—์„œ ํŽธ๋ฆฌํ•˜๊ฒŒ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
147
+ ```
148
+
149
+ #### OCR
150
+
151
+ To perform Optical Character Recognition (OCR), use the `<ocr>` token.
152
+
153
+ ```python
154
+ image_file = "./assets/ocr_1.png"
155
+ raw_image = Image.open(image_file)
156
+
157
+ conversation = [
158
+ {
159
+ "role": "user",
160
+ "content": [
161
+ {"type": "text", "text": "<ocr>"},
162
+ {"type": "image"},
163
+ ],
164
+ },
165
+ ]
166
+ ```
167
+
168
+ **Expected Output Example:**
169
+
170
+ ```
171
+ <char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.266, 0.328, 0.341</bbox>
172
+ <char>124๋ฒˆ๊ธธ</char><bbox>0.347, 0.266, 0.512, 0.341</bbox>
173
+ <char>Baekbeom-ro</char><bbox>0.171, 0.337, 0.433, 0.392</bbox>
174
+ <char>124</char><bbox>0.444, 0.341, 0.508, 0.392</bbox>
175
+ <char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.531, 0.335, 0.601</bbox>
176
+ <char>์‹œํฅ</char><bbox>0.443, 0.518, 0.522, 0.581</bbox>
177
+ <char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
178
+ <char>Mansu</char><bbox>0.102, 0.601, 0.181, 0.648</bbox>
179
+ <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
180
+ <char>Apt</char><bbox>0.28, 0.601, 0.327, 0.651</bbox>
181
+ <char>42</char><bbox>0.377, 0.601, 0.416, 0.648</bbox>
182
+ <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.625</bbox>
183
+ <char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.43, 0.621, 0.609, 0.684</bbox>
184
+ <char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.59, 0.873, 0.665</bbox>
185
+ <char>IncheonGrand</char><bbox>0.432, 0.681, 0.561, 0.723</bbox>
186
+ <char>Park</char><bbox>0.564, 0.681, 0.611, 0.723</bbox>
187
+ ```
188
+
189
+ <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
190
+
191
+ ## Citing the Model
192
+
193
+ (*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
194
+
195
+ ```bibtex
196
+
197
  ```