# PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation
- PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding. ## Multi-granular Semantic Visual Decoding
- PUMA's visual decoding process spans five granular image representations (f0 to f4) and corresponding decoders (D0 to D4), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks. ## Diverse Text-to-image Generation
## Image Editing
## Image Conditional Generation
## Citation If you find PUMA useful in your research, please consider citing us: ``` @article{fang2024puma, title ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation}, author ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu}, journal ={arxiv}, year ={2024} } ``` ## License This project is released under the [Apache 2.0 license](LICENSE).