# PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation
Home visitors [Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)1\*, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)2\*, [Kun Wang]()3, [Hao Li](https://scholar.google.com/citations?user=qHqQsY4AAAAJ&hl=zh-CN)1,4, [Hao Tian]()3, [Xingyu Zeng]()3, [Rui Zhao]()3, [Jifeng Dai](https://jifengdai.org/)4,5, [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)1 :envelope:, [Xihui Liu](https://xh-liu.github.io/)2 :envelope: 1CUHK MMLab, 2HKU MMLab, 3SenseTime, 4Shanghai AI Laboratory, 5Tsinghua University *Equal contribution, :envelope:Corresponding authors
## Environment Setup ``` conda create -n puma python==3.8 conda activate puma pip install -r requirements.txt ``` ## Checkpoint Download ``` # You should first replace the with your huggingface token python download_ckpt.py ``` For manual downloads, please download checkpoints from [here](https://huggingface.co./LucasFang/PUMA) and put the checkpoints under **./ckpts**. ## Multi-granular Visual Decoding ``` python infer_detokenizer.py --num_tokens ``` ## Abstract > **PUMA** introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks. Read the full paper [here](https://arxiv.org/abs/2410.13861). ## Framework

- PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding. ## Multi-granular Semantic Visual Decoding

- PUMA's visual decoding process spans five granular image representations (f0 to f4) and corresponding decoders (D0 to D4), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks. ## Diverse Text-to-image Generation

## Image Editing

## Image Conditional Generation

## Citation If you find PUMA useful in your research, please consider citing us: ``` @article{fang2024puma, title ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation}, author ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu}, journal ={arxiv}, year ={2024} } ``` ## License This project is released under the [Apache 2.0 license](LICENSE).