--- language: - en ---

Generative Multimodal Models are In-Context Learners

Generative Multimodal Models are In-Context Learners

[Quan Sun](https://github.com/Quan-Sun)1*, [Yufeng Cui](https://scholar.google.com/citations?hl=en&user=5Ydha2EAAAAJ)1*, [Xiaosong Zhang](https://zhangxiaosong18.github.io)1*, [Fan Zhang](https://scholar.google.com/citations?user=VsJ39HMAAAAJ)1*, [Qiying Yu](https://yqy2001.github.io)2,1*, [Zhengxiong Luo](https://greatlog.github.io)1, [Yueze Wang]()1, [Yongming Rao](https://raoyongming.github.io)1,
[Jingjing Liu](https://air.tsinghua.edu.cn/en/info/1046/1194.htm)2, [Tiejun Huang](https://scholar.google.com/citations?user=knvEK4AAAAAJ&hl=en)1,3, [Xinlong Wang](https://www.xloong.wang/)1† 1 [BAAI](https://www.baai.ac.cn/english.html), 2 [THU](https://air.tsinghua.edu.cn), 3 [PKU](https://english.pku.edu.cn/)
* equal contribution † project lead | [Paper](https://arxiv.org/abs/2312.13286) | [🤗HF Demo](https://huggingface.co./spaces/BAAI/Emu2) | [Demo](https://emu.ssi.plus) | [Project Page](https://baaivision.github.io/emu2/) | [Github](https://github.com/baaivision/Emu)
The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce **Emu2**, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. **Emu2** exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, **Emu2** further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that **Emu2** can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research. ## Model Weights | Model name | Weight | | ------------------ | ------------------------------------------------------- | | **Emu2** | [🤗 HF link](https://huggingface.co./BAAI/Emu2) | | **Emu2-Chat** | [🤗 HF link](https://huggingface.co./BAAI/Emu2-Chat) | | **Emu2-Gen** | [🤗 HF link](https://huggingface.co./BAAI/Emu2-Gen) | ## Inference (Huggingface Version) #### Single GPU ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2") model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).to('cuda').eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = '[]Describe the image in details:' image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB') inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=[image] ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` Interleaved image and text ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2") model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).to('cuda').eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = "[][red, white, 3, bottom left].[][yellow, white, 2, top left].[][green, black, 4, bottom right][]" images = [ Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/red_white_3_bottom_left.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/yellow_white_2_top_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/green_black_4_bottom_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB'), ] inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=images ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` #### Multi GPU ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2") with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True) device_map = infer_auto_device_map(model, max_memory={0:'38GiB',1:'38GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer']) # input and output logits should be on same device device_map["model.decoder.lm.lm_head"] = 0 model = load_checkpoint_and_dispatch( model, 'local/path/to/hf/version/Emu2/model', device_map=device_map).eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = '[]Describe the image in details:' image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB') inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=[image] ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` Interleaved image and text ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2") with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True) device_map = infer_auto_device_map(model, max_memory={0:'38GiB',1:'38GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer']) # input and output logits should be on same device device_map["model.decoder.lm.lm_head"] = 0 model = load_checkpoint_and_dispatch( model, 'local/path/to/hf/version/Emu2/model', device_map=device_map).eval() # `[]` is the image placeholder which will be replaced by image embeddings. # the number of `[]` should be equal to the number of input images query = "[][red, white, 3, bottom left].[][yellow, white, 2, top left].[][green, black, 4, bottom right][]" images = [ Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/red_white_3_bottom_left.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/yellow_white_2_top_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/green_black_4_bottom_right.jpg?raw=true',stream=True).raw).convert('RGB'), Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB'), ] inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=images ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.bfloat16), max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` #### Quantization Check quantization guidance at [transformers](https://huggingface.co./docs/transformers/v4.28.0/main_classes/quantization) ```python from PIL import Image import requests import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2") model = AutoModelForCausalLM.from_pretrained( "BAAI/Emu2", load_in_4bit=True, trust_remote_code=True, bnb_4bit_compute_dtype=torch.float16).eval() query = '[]Describe the image in details:' image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB') inputs = model.build_input_ids( text=[query], tokenizer=tokenizer, image=[image] ) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], image=inputs["image"].to(torch.float16), # should be torch.float16 max_new_tokens=64, length_penalty=-1) output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) ```