File size: 3,394 Bytes
feb57de
634b601
 
 
17b3b21
 
feb57de
 
634b601
feb57de
 
 
634b601
 
 
feb57de
 
 
 
 
634b601
 
 
 
 
feb57de
 
 
 
 
634b601
feb57de
634b601
 
 
feb57de
634b601
 
feb57de
634b601
feb57de
634b601
 
 
 
 
feb57de
634b601
feb57de
634b601
feb57de
634b601
 
 
 
 
feb57de
634b601
 
 
feb57de
634b601
 
feb57de
634b601
 
feb57de
634b601
 
feb57de
634b601
 
feb57de
634b601
feb57de
634b601
 
feb57de
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
language:
- ko
- en
license: cc-by-nc-sa-4.0
library_name: transformers
---

# Llama-3-KoEn-8B-xtuner-llava-preview ๐ŸŒ‹

<!-- Provide a quick summary of what the model is/does. -->

Llama-3-KoEn-8B-xtuner-llava-preview ๐ŸŒ‹ is Korean based MutliModal based on Llava architecture, merged with [ChatVector](https://arxiv.org/abs/2310.04799) methods leveraging 2 models: 
1) [beomi/Llama-3-KoEn-8B-preview](https://huggingface.co./beomi/Llama-3-KoEn-8B-preview),
2) [xtuner/llava-llama-3-8b-transformers](https://huggingface.co./xtuner/llava-llama-3-8b-transformers)

## Model Details

### Model Description

- **Developed by:** Junbum Lee (Beomi)
- **Model type:** HuggingFace Llava ๐ŸŒ‹
- **Language(s) (NLP):** Korean, English
- **License:** cc-by-nc-sa-4.0 under Llama3 License
- **Merged from model:** [beomi/Llama-3-KoEn-8B-preview](https://huggingface.co./beomi/Llama-3-KoEn-8B-preview) / [xtuner/llava-llama-3-8b-transformers](https://huggingface.co./xtuner/llava-llama-3-8b-transformers)

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

![Cat walking on frozen Han-River, Seoul](https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/NWfoArWI4UPAxpEnolkwT.jpeg)

```python
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "beomi/Llama-3-KoEn-8B-xtuner-llava-preview"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype='auto', 
    device_map='auto',
)

processor = AutoProcessor.from_pretrained(model_id)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('./llava-llama-3-KoEn-8b-v1_1-transformers')
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\n์ด ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”.<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n์ด ์ด๋ฏธ์ง€์—๋Š”")
image_file = "https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/NWfoArWI4UPAxpEnolkwT.jpeg"

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=400, do_sample=True, eos_token_id=terminators,)
print(processor.decode(output[0][2:], skip_special_tokens=False))

# --- Example Output ---
user<|end_header_id|>

<image>
์ด ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

์ด ์ด๋ฏธ์ง€์—๋Š” ๊ณ ์–‘์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๊ฐ•๋ฌผ ์œ„๋ฅผ ๊ฑธ์–ด๊ฐ€๋Š” ๋ชจ์Šต์ด ๋ณด์—ฌ์ง‘๋‹ˆ๋‹ค. ๊ณ ์–‘์ด๋Š” ๊ฐ•๋ฌผ์˜ ์ž”๋ฌผ๊ฒฐ์— ๋ฏธ๋„๋Ÿผ์„ ํƒ€๊ณ  ๊ฐ• ๊ฐ€๋กœ๋ฅผ ์ง€๋‚˜๋Š” ๋ฐ ๋Šฅ์ˆ™ํ•˜๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ณ ์–‘์ด์˜ ๋ฐœ์€ ๊ฐ•๋ฌผ๋กœ ์ž˜ ๋“ค์–ด๊ฐ€, ๊ทธ๊ฒƒ์„ ์ฆ๊ธฐ๋ฉฐ ๊ฑธ์–ด๊ฐ‘๋‹ˆ๋‹ค. 

๋˜ํ•œ ์ด ์ด๋ฏธ์ง€๋„ ์Œ์„ฑ ๋…น์Œ์„ ํ•˜๊ฑฐ๋‚˜ ๋…นํ™”๋œ ์ž๋ฃŒ๋กœ ์ œ์ž‘๋˜์—ˆ์œผ๋ฉฐ, ์ฃผ๋กœ ๊ณ ์–‘์ด์˜ ๋ชจ์Šต์„ ๊ฐ•ํ•˜๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์†Œ๋ฆฌ ํšจ๊ณผ๋„ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ณ ์–‘์ด์˜ ์Šคํ† ๋ฆฌ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ•๋ฌผ์€ ์ž”๋ฌผ๊ฒฐ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ ๊ฐ•๋ฌผ ์œ„๋ฅผ ๊ฑท๋Š” ๊ณ ์–‘์ด์˜ ๋ชจ์Šต์„ ๋”์šฑ ๊ฐ•๋ ฌํ•˜๊ฒŒ ๊ฐ•์กฐํ•˜๊ธฐ ์œ„ํ•ด ์ž”๋ฌผ๊ฒฐ์„ ํ†ตํ•ด ๋” ๋””ํ…Œ์ผํ•œ ์žฅ๋ฉด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.<|eot_id|>
```