--- pipeline_tag: text-classification --- ## MiniCPM-V **MiniCPM-V** is an efficient version with promising performance for deployment. The model is built based on MiniCPM-2.4B and SigLip-400M, connected by a perceiver resampler. Notable features of MiniCPM-V include: - 🚀 **High Efficiency.** MiniCPM-V can be **efficiently deployed on most GPU cards and personal computers**, and **even on edge devices such as mobile phones**. In terms of visual encoding, we compress the image representations into 64 tokens via a perceiver resampler, which is significantly fewer than other LMMs based on MLP architecture (typically > 512 tokens). This allows MiniCPM-V to operate with **much less memory cost and higher speed during inference**. - 🔥 **Promising Performance.** MiniCPM-V achieves **state-of-the-art performance** on multiple benchmarks (including MMMU, MME, and MMbech, etc) among models with comparable sizes, surpassing existing LMMs built on Phi-2. It even **achieves comparable or better performance than the 9.6B Qwen-VL-Chat**. - 🙌 **Bilingual Support.** MiniCPM-V is **the first edge-deployable LMM supporting bilingual multimodal interaction in English and Chinese**. This is achieved by generalizing multimodal capabilities across languages, a technique from our ICLR 2024 spotlight [paper](https://arxiv.org/abs/2308.12038).
Model | Size | MME | MMB dev (en) | MMB dev (zh) | MMMU val | CMMMU val |
---|---|---|---|---|---|---|
LLaVA-Phi | 3.0B | 1335 | 59.8 | - | - | - |
MobileVLM | 3.0B | 1289 | 59.6 | - | - | - |
Imp-v1 | 3B | 1434 | 66.5 | - | - | - |
Qwen-VL-Chat | 9.6B | 1487 | 60.6 | 56.7 | 35.9 | 30.7 |
MiniCPM-V | 3B | 1452 | 67.3 | 61.9 | 34.7 | 32.1 |
|
|