Papers
arxiv:2404.05674

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Published on Apr 8
· Submitted by akhaliq on Apr 9
Authors:
,
,
,
,
,

Abstract

In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements.

Community

Paper author

27845d41300ade252b66c6b53b1a91df.jpg add man in this image to this image at front man
2.webp

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.05674 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 7