# PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation

[Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)^1\*, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)^2\*, [Kun Wang]()³, [Hao Li](https://scholar.google.com/citations?user=qHqQsY4AAAAJ&hl=zh-CN)^1,4, [Hao Tian]()³, [Xingyu Zeng]()³, [Rui Zhao]()³, [Jifeng Dai](https://jifengdai.org/)^4,5, [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)^{1 :envelope:}, [Xihui Liu](https://xh-liu.github.io/)^{2 :envelope:} ¹CUHK MMLab, ²HKU MMLab, ³SenseTime, ⁴Shanghai AI Laboratory, ⁵Tsinghua University *Equal contribution, :envelope:Corresponding authors

## Environment Setup ``` conda create -n puma python==3.8 conda activate puma pip install -r requirements.txt ``` ## Checkpoint Download ``` # You should first replace the with your huggingface token python download_ckpt.py ``` For manual downloads, please download checkpoints from [here](https://huggingface.co./LucasFang/PUMA) and put the checkpoints under **./ckpts**. ## Multi-granular Visual Decoding ``` python infer_detokenizer.py --num_tokens ``` ## Abstract > **PUMA** introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks. Read the full paper [here](https://arxiv.org/abs/2410.13861). ## Framework

- PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding. ## Multi-granular Semantic Visual Decoding

- PUMA's visual decoding process spans five granular image representations (f₀ to f₄) and corresponding decoders (D₀ to D₄), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks. ## Diverse Text-to-image Generation

## Image Editing

## Image Conditional Generation

## Citation If you find PUMA useful in your research, please consider citing us: ``` @article{fang2024puma, title ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation}, author ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu}, journal ={arxiv}, year ={2024} } ``` ## License This project is released under the [Apache 2.0 license](LICENSE).