BAAI
/

Emu3-VisionTokenizer

Feature Extraction

Model card Files Files and versions Community

Emu3-VisionTokenizer / README.md

ryanzhangfan's picture

Update README.md

c81f916 verified 4 months ago

|

history blame contribute delete

3.53 kB

	---
	license: apache-2.0
	library_name: transformers
	---

	<div align='center'>
	<h1>Emu3: Next-Token Prediction is All You Need</h1h1>
	<h3></h3>

	[Emu3 Team, BAAI](https://www.baai.ac.cn/english.html)

	\| [Project Page](https://emu.baai.ac.cn) \| [Paper](https://huggingface.co./papers/2409.18869) \| [🤗HF Models](https://huggingface.co./collections/BAAI/emu3-66f4e64f70850ff358a2e60f) \| [github](https://github.com/baaivision/Emu3) \| [Demo](https://huggingface.co./spaces/BAAI/Emu3) \|


	</div>

	<div align='center'>
	<img src="https://github.com/baaivision/Emu3/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="80%" width="70%" />
	</div>

	We introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with <i>next-token prediction</i>! By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.

	### Emu3 excels in both generation and perception
	Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.

	<div align='center'>
	<img src="https://github.com/baaivision/Emu3/blob/main//assets/comparison.png?raw=True" class="interpolation-image" alt="comparison." height="80%" width="80%" />
	</div>

	### Highlights

	- Emu3 is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles.
	- Emu3 shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
	- Emu3 simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.

	### Quickstart for Autoencoding
	```python
	import os
	import os.path as osp

	from PIL import Image
	import torch
	from transformers import AutoModel, AutoImageProcessor

	MODEL_HUB = "BAAI/Emu3-VisionTokenizer"

	model = AutoModel.from_pretrained(MODEL_HUB, trust_remote_code=True).eval().cuda()
	processor = AutoImageProcessor.from_pretrained(MODEL_HUB, trust_remote_code=True)

	# TODO: you need to modify the path here
	VIDEO_FRAMES_PATH = "YOUR_VIDEO_FRAMES_PATH"

	video = os.listdir(VIDEO_FRAMES_PATH)
	video.sort()
	video = [Image.open(osp.join(VIDEO_FRAMES_PATH, v)) for v in video]

	images = processor(video, return_tensors="pt")["pixel_values"]
	images = images.unsqueeze(0).cuda()

	# image autoencode
	image = images[:, 0]
	print(image.shape)
	with torch.no_grad():
	# encode
	codes = model.encode(image)
	# decode
	recon = model.decode(codes)

	recon = recon.view(-1, *recon.shape[2:])
	recon_image = processor.postprocess(recon)["pixel_values"][0]
	recon_image.save("recon_image.png")

	# video autoencode
	images = images.view(
	-1,
	model.config.temporal_downsample_factor,
	*images.shape[2:],
	)

	print(images.shape)
	with torch.no_grad():
	# encode
	codes = model.encode(images)
	# decode
	recon = model.decode(codes)

	recon = recon.view(-1, *recon.shape[2:])
	recon_images = processor.postprocess(recon)["pixel_values"]
	for idx, im in enumerate(recon_images):
	im.save(f"recon_video_{idx}.png")

	```