Spaces:
Running
on
L40S
Running
on
L40S
Update README.md
Browse files
README.md
CHANGED
@@ -1,218 +1,12 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
<img src="./assets/arch.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
|
14 |
-
</div>
|
15 |
-
|
16 |
-
We introduce **Emu3**, a new suite of state-of-the-art multimodal models trained solely with **<i>next-token prediction</i>**! By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.
|
17 |
-
|
18 |
-
### Emu3 excels in both generation and perception
|
19 |
-
**Emu3** outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.
|
20 |
-
|
21 |
-
<div align='center'>
|
22 |
-
<img src="./assets/comparison.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
|
23 |
-
</div>
|
24 |
-
|
25 |
-
### Highlights
|
26 |
-
|
27 |
-
- **Emu3** is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles.
|
28 |
-
- **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
|
29 |
-
- **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
|
30 |
-
|
31 |
-
|
32 |
-
### TODO
|
33 |
-
|
34 |
-
- [X] Release model weights of tokenizer, Emu3-Chat and Emu3-Gen
|
35 |
-
- [X] Release the inference code.
|
36 |
-
- [ ] Release the evaluation code.
|
37 |
-
- [ ] Release training scripts for pretrain, sft and dpo.
|
38 |
-
|
39 |
-
|
40 |
-
### Setup
|
41 |
-
|
42 |
-
Clone this repository and install required packages:
|
43 |
-
|
44 |
-
```shell
|
45 |
-
git clone https://github.com/baaivision/Emu3
|
46 |
-
cd Emu3
|
47 |
-
|
48 |
-
pip install -r requirements.txt
|
49 |
-
```
|
50 |
-
|
51 |
-
### Model Weights
|
52 |
-
|
53 |
-
| Model name | HF Weight |
|
54 |
-
| ------------------ | ------------------------------------------------------- |
|
55 |
-
| **Emu3-Chat** | [🤗 HF link](https://huggingface.co/BAAI/Emu3-Chat) |
|
56 |
-
| **Emu3-Gen** | [🤗 HF link](https://huggingface.co/BAAI/Emu3-Gen) |
|
57 |
-
| **Emu3-VisionTokenizer** | [🤗 HF link](https://huggingface.co/BAAI/Emu3-VisionTokenizer) |
|
58 |
-
|
59 |
-
### Quickstart
|
60 |
-
|
61 |
-
#### Use 🤗Transformers to run Emu3-Gen for image generation
|
62 |
-
```python
|
63 |
-
from PIL import Image
|
64 |
-
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
|
65 |
-
from transformers.generation.configuration_utils import GenerationConfig
|
66 |
-
from transformers.generation import LogitsProcessorList, PrefixConstrainedLogitsProcessor, UnbatchedClassifierFreeGuidanceLogitsProcessor
|
67 |
-
import torch
|
68 |
-
|
69 |
-
from emu3.mllm.processing_emu3 import Emu3Processor
|
70 |
-
|
71 |
-
|
72 |
-
# model path
|
73 |
-
EMU_HUB = "BAAI/Emu3-Gen"
|
74 |
-
VQ_HUB = "BAAI/Emu3-VisionTokenizer"
|
75 |
-
|
76 |
-
# prepare model and processor
|
77 |
-
model = AutoModelForCausalLM.from_pretrained(
|
78 |
-
EMU_HUB,
|
79 |
-
device_map="cuda:0",
|
80 |
-
torch_dtype=torch.bfloat16,
|
81 |
-
attn_implementation="flash_attention_2",
|
82 |
-
trust_remote_code=True,
|
83 |
-
)
|
84 |
-
|
85 |
-
tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True)
|
86 |
-
image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)
|
87 |
-
image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval()
|
88 |
-
processor = Emu3Processor(image_processor, image_tokenizer, tokenizer)
|
89 |
-
|
90 |
-
# prepare input
|
91 |
-
POSITIVE_PROMPT = " masterpiece, film grained, best quality."
|
92 |
-
NEGATIVE_PROMPT = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
|
93 |
-
|
94 |
-
classifier_free_guidance = 3.0
|
95 |
-
prompt = "a portrait of young girl."
|
96 |
-
prompt += POSITIVE_PROMPT
|
97 |
-
|
98 |
-
kwargs = dict(
|
99 |
-
mode='G',
|
100 |
-
ratio="1:1",
|
101 |
-
image_area=model.config.image_area,
|
102 |
-
return_tensors="pt",
|
103 |
-
)
|
104 |
-
pos_inputs = processor(text=prompt, **kwargs)
|
105 |
-
neg_inputs = processor(text=NEGATIVE_PROMPT, **kwargs)
|
106 |
-
|
107 |
-
# prepare hyper parameters
|
108 |
-
GENERATION_CONFIG = GenerationConfig(
|
109 |
-
use_cache=True,
|
110 |
-
eos_token_id=model.config.eos_token_id,
|
111 |
-
pad_token_id=model.config.pad_token_id,
|
112 |
-
max_new_tokens=40960,
|
113 |
-
do_sample=True,
|
114 |
-
top_k=2048,
|
115 |
-
)
|
116 |
-
|
117 |
-
h, w = pos_inputs.image_size[0]
|
118 |
-
constrained_fn = processor.build_prefix_constrained_fn(h, w)
|
119 |
-
logits_processor = LogitsProcessorList([
|
120 |
-
UnbatchedClassifierFreeGuidanceLogitsProcessor(
|
121 |
-
classifier_free_guidance,
|
122 |
-
model,
|
123 |
-
unconditional_ids=neg_inputs.input_ids.to("cuda:0"),
|
124 |
-
),
|
125 |
-
PrefixConstrainedLogitsProcessor(
|
126 |
-
constrained_fn ,
|
127 |
-
num_beams=1,
|
128 |
-
),
|
129 |
-
])
|
130 |
-
|
131 |
-
# generate
|
132 |
-
outputs = model.generate(
|
133 |
-
pos_inputs.input_ids.to("cuda:0"),
|
134 |
-
GENERATION_CONFIG,
|
135 |
-
logits_processor=logits_processor
|
136 |
-
)
|
137 |
-
|
138 |
-
mm_list = processor.decode(outputs[0])
|
139 |
-
for idx, im in enumerate(mm_list):
|
140 |
-
if not isinstance(im, Image.Image):
|
141 |
-
continue
|
142 |
-
im.save(f"result_{idx}.png")
|
143 |
-
```
|
144 |
-
|
145 |
-
#### Use 🤗Transformers to run Emu3-Chat for vision-language understanding
|
146 |
-
|
147 |
-
```python
|
148 |
-
from PIL import Image
|
149 |
-
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
|
150 |
-
from transformers.generation.configuration_utils import GenerationConfig
|
151 |
-
import torch
|
152 |
-
|
153 |
-
from emu3.mllm.processing_emu3 import Emu3Processor
|
154 |
-
|
155 |
-
|
156 |
-
# model path
|
157 |
-
EMU_HUB = "BAAI/Emu3-Chat"
|
158 |
-
VQ_HUB = "BAAI/Emu3-VisionTokenier"
|
159 |
-
|
160 |
-
# prepare model and processor
|
161 |
-
model = AutoModelForCausalLM.from_pretrained(
|
162 |
-
EMU_HUB,
|
163 |
-
device_map="cuda:0",
|
164 |
-
torch_dtype=torch.bfloat16,
|
165 |
-
attn_implementation="flash_attention_2",
|
166 |
-
trust_remote_code=True,
|
167 |
-
)
|
168 |
-
|
169 |
-
tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True)
|
170 |
-
image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)
|
171 |
-
image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval()
|
172 |
-
processor = Emu3Processor(image_processor, image_tokenizer, tokenizer)
|
173 |
-
|
174 |
-
# prepare input
|
175 |
-
text = "Please describe the image"
|
176 |
-
image = Image.open("assets/demo.png")
|
177 |
-
|
178 |
-
inputs = processor(
|
179 |
-
text=text,
|
180 |
-
image=image,
|
181 |
-
mode='U',
|
182 |
-
padding_side="left",
|
183 |
-
padding="longest",
|
184 |
-
return_tensors="pt",
|
185 |
-
)
|
186 |
-
|
187 |
-
# prepare hyper parameters
|
188 |
-
GENERATION_CONFIG = GenerationConfig(pad_token_id=tokenizer.pad_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id)
|
189 |
-
|
190 |
-
# generate
|
191 |
-
outputs = model.generate(
|
192 |
-
inputs.input_ids.to("cuda:0"),
|
193 |
-
GENERATION_CONFIG,
|
194 |
-
max_new_tokens=320,
|
195 |
-
)
|
196 |
-
|
197 |
-
outputs = outputs[:, inputs.input_ids.shape[-1]:]
|
198 |
-
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
|
199 |
-
```
|
200 |
-
|
201 |
-
## Acknowledgement
|
202 |
-
|
203 |
-
We thank the great work from [Emu Series](https://github.com/baaivision/Emu), [QWen2-VL](https://github.com/QwenLM/Qwen2-VL) and [MoVQGAN](https://github.com/ai-forever/MoVQGAN)
|
204 |
-
|
205 |
-
<!--
|
206 |
-
## Citation
|
207 |
-
|
208 |
-
If you find Emu3 useful for your research and applications, please consider starring this repository and citing:
|
209 |
-
|
210 |
-
```
|
211 |
-
@article{Emu2,
|
212 |
-
title={Generative Multimodal Models are In-Context Learners},
|
213 |
-
author={Quan Sun and Yufeng Cui and Xiaosong Zhang and Fan Zhang and Qiying Yu and Zhengxiong Luo and Yueze Wang and Yongming Rao and Jingjing Liu and Tiejun Huang and Xinlong Wang},
|
214 |
-
publisher={arXiv preprint arXiv:2312.13286},
|
215 |
-
year={2023},
|
216 |
-
}
|
217 |
-
```
|
218 |
-
-->
|
|
|
1 |
+
---
|
2 |
+
title: Emu3
|
3 |
+
emoji: 🌖
|
4 |
+
colorFrom: gray
|
5 |
+
colorTo: green
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.0.0b1
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|