File size: 8,904 Bytes
1659466
 
 
 
 
 
 
 
 
 
 
 
 
 
394c2ea
1659466
 
 
 
 
 
c426fe6
1659466
 
6073a27
7024e56
5f6964f
1659466
 
 
 
 
 
 
59c526d
1659466
 
 
 
 
9c529eb
1659466
 
 
 
 
 
 
 
5d86c60
1659466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d86c60
1659466
 
 
 
 
 
 
 
 
 
 
 
c426fe6
 
 
1659466
 
 
 
 
 
 
 
 
 
c426fe6
1659466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c426fe6
1659466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6073a27
1659466
 
6073a27
 
 
 
 
 
 
 
1659466
f3bb353
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
language:
- en
- ko
license: cc-by-nc-4.0
tags:
- multimodal
- conversational
- ncsoft
- varco
base_model:
- Qwen/Qwen2.5-14B-Instruct
- google/siglip-so400m-patch14-384
library_name: transformers
pipeline_tag: image-text-to-text
---

# VARCO-VISION-14B-HF

## About the Model

**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models.  The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).

- **Developed by:** NC Research, Multimodal Generation Team
- **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
- **Blog(Korean):** [VARCO-VISION Technical Report Summary](https://ncsoft.github.io/ncresearch/95ad8712e60063e9ac97538504ac3eea0ac530af)
- **Demo Page:** *The demo page is no longer available.*
- **Languages:** Korean, English
- **License:** CC BY-NC 4.0
- **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
- **Base Model:**
  - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co./Qwen/Qwen2.5-14B-Instruct)
  - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co./google/siglip-so400m-patch14-384)
- **LLaVA-NeXT Codebase Model:** [NCSOFT/VARCO-VISION-14B](https://huggingface.co./NCSOFT/VARCO-VISION-14B)
- **Korean VLM Benchmarks:**
  - [NCSOFT/K-MMBench](https://huggingface.co./datasets/NCSOFT/K-MMBench)
  - [NCSOFT/K-SEED](https://huggingface.co./datasets/NCSOFT/K-SEED)
  - [NCSOFT/K-MMStar](https://huggingface.co./datasets/NCSOFT/K-MMStar)
  - [NCSOFT/K-DTCBench](https://huggingface.co./datasets/NCSOFT/K-DTCBench)
  - [NCSOFT/K-LLaVA-W](https://huggingface.co./datasets/NCSOFT/K-LLaVA-W)
- **This model is for research purposes only. Commercial use is prohibited.**


## Uses

### Direct Use
To use this model, ensure you have `transformers >= 4.45.0` installed.

```python
import torch
import requests
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor

model_name = "NCSOFT/VARCO-VISION-14B-HF"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
        model_name,
        torch_dtype="float16",
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
processor = AutoProcessor.from_pretrained(model_name)
device = model.device

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

EOS_TOKEN = "<|im_end|>"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(device, torch.float16)

output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output = processor.decode(output[0][inputs.input_ids.shape[1]:])
if output.endswith(EOS_TOKEN):
    output = output[: -len(EOS_TOKEN)]

output = output.strip()
print(output)
```

### Specialized Features

If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text.
 
The following special tokens are used to define specific tasks, inputs, and outputs for the model:

- `<gro>`: Indicates that the model's response should include bounding box information.
- `<ocr>`: Specifies OCR tasks for recognizing text within an image.
- `<char>` and `</char>`: Used to mark a text phrase.
- `<obj>` and `</obj>`: Used to indicate an object.
- `<bbox>` and `</bbox>`: Used to represent a bounding box.
- `<delim>`: Represents multiple location points for a single object or text.

#### Grounding

Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token `<gro>` to the question.

```python
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "<gro>\nDescribe the image in detail."},
            {"type": "image"},
        ],
    },
]
```

**Expected Output Example:**
```html
The image shows <obj>two cats</obj><bbox>0.014, 0.106, 0.51, 0.996<delim>0.51, 0.054, 0.996, 0.787</bbox> lying on <obj>a pink blanket</obj><bbox>0.003, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket, while the cat on the right is lying on its stomach with its head also resting on the blanket. Both cats appear to be relaxed and comfortable. There are <obj>two remote controls</obj><bbox>0.037, 0.141, 0.283, 0.253<delim>0.506, 0.171, 0.581, 0.295</bbox> placed near the cats, one on the left side and one on the right side of the image.
```

<img src="assets/grounding.png" alt="Grounding Example" width="400"/>

#### Referring

VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within `<obj>` and `</obj>` tags. You have to specify its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.

```python
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
            },
            {"type": "image"},
        ],
    },
]
```

**Expected Output Example:**
```
**์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์—๋Š” ๋‹ค์–‘ํ•œ ๋ฒ„ํŠผ์ด ์žˆ์œผ๋ฉฐ, ๊ฐ  ๋ฒ„ํŠผ์€ ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ๋ฆฌ๋ชจ์ปจ์„ ์†์— ๋“ค๊ณ  ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด, ํ•ด๋‹น ๊ธฐ๊ธฐ์— ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ด ์›ํ•˜๋Š” ์กฐ์ž‘์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€์ •์ด๋‚˜ ์‚ฌ๋ฌด์‹ค์—์„œ ํŽธ๋ฆฌํ•˜๊ฒŒ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
```

#### OCR

To perform Optical Character Recognition (OCR), use the `<ocr>` token.

```python
image_file = "./assets/ocr_1.png"
raw_image = Image.open(image_file)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "<ocr>"},
            {"type": "image"},
        ],
    },
]
```

**Expected Output Example:**

```
<char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.266, 0.328, 0.341</bbox>
<char>124๋ฒˆ๊ธธ</char><bbox>0.347, 0.266, 0.512, 0.341</bbox>
<char>Baekbeom-ro</char><bbox>0.171, 0.337, 0.433, 0.392</bbox>
<char>124</char><bbox>0.444, 0.341, 0.508, 0.392</bbox>
<char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.531, 0.335, 0.601</bbox>
<char>์‹œํฅ</char><bbox>0.443, 0.518, 0.522, 0.581</bbox>
<char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
<char>Mansu</char><bbox>0.102, 0.601, 0.181, 0.648</bbox>
<char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
<char>Apt</char><bbox>0.28, 0.601, 0.327, 0.651</bbox>
<char>42</char><bbox>0.377, 0.601, 0.416, 0.648</bbox>
<char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.625</bbox>
<char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.43, 0.621, 0.609, 0.684</bbox>
<char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.59, 0.873, 0.665</bbox>
<char>IncheonGrand</char><bbox>0.432, 0.681, 0.561, 0.723</bbox>
<char>Park</char><bbox>0.564, 0.681, 0.611, 0.723</bbox>
```

<img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>

## Citing the Model

If you use VARCO-VISION-14B in your research, please cite the following: 

```bibtex
@misc{ju2024varcovisionexpandingfrontierskorean,
    title={VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models}, 
    author={Jeongho Ju and Daeyoung Kim and SunYoung Park and Youngjune Kim},
    year={2024},
    eprint={2411.19103},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.19103}, 

```