mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
Abstract
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MLLM4PUE: Toward Universal Embeddings in Computational Pathology through Multimodal LLMs (2025)
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs (2024)
- SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning (2025)
- MINIMA: Modality Invariant Image Matching (2024)
- Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks (2025)
- Asymmetric Reinforcing against Multi-modal Representation Bias (2025)
- Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend