google/paligemma2-3b-mix-448 · Compatibility of Siglip 2 Embeddings for Text Generation with PaliGemma 2

5 days ago

Hello,
I’m working on a test application and would like to know if it’s feasible to use embeddings generated by Siglip 2 (e.g., google/siglip2-so400m-patch16-naflex) to generate text with PaliGemma 2 (e.g., google/paligemma2-3b-ft-docci-448). Specifically, I’m precomputing patch-level embeddings (e.g., [1, 784, 1152]) from Siglip 2 and wondering if they can be fed into PaliGemma 2 to produce text descriptions, bypassing its default vision encoder. Is this approach supported out-of-the-box with the Hugging Face implementation, or would it require modifications (e.g., to the projection layer or model architecture)? Any advice on making this work would be greatly appreciated!

GopiUppari

Google org 3 days ago

Hi @zfjerome1 ,

It is possible to use SigLiP 2 embeddings with PaliGemma 2, it is not straightforward and isn't supported out-of-the-box. Follow below steps:

Extract patch embeddings from SigLiP 2 (shape [1, 784, 1152])
Create a projection layer to convert SigLiP's embedding dimensions to match what PaliGemma expects
Modify PaliGemma's code to accept these external embeddings instead of using its own vision encoder.

Thank you.

zfjerome1

2 days ago

Hi @GopiUppari , thank you kindly for the guidance. With the help of Sonnet 3.7, I made some progress that seem to have helped. Could you help me review the following to see if this is the right approach:

The approach used a "semantic mapping" strategy with three key components:

Spatial Adaptation: Transformed SigLIP 2's 27×27 grid (729 tokens) to PaliGemma's expected 32×32 grid (1024 tokens) using bilinear interpolation, preserving spatial relationships.
Semantic Transformation: Created a linear projection matrix using torch.linalg.lstsq that maps SigLIP 2's embedding values to closely match those produced by PaliGemma's original vision tower.
Custom Vision Tower: Implemented a drop-in replacement that:
- Processes images with SigLIP 2
- Applies the spatial resizing
- Transforms the embedding values with the learned projection
- Returns embeddings in the format PaliGemma expects

This works because while both models use the same embedding dimension (1152), they encode visual concepts differently. The least squares solution finds a transformation that makes SigLIP 2's "visual language" more compatible with what PaliGemma expects.
The success of this approach (matching the original description exactly) demonstrates that it's possible to use newer vision models with existing multimodal systems through careful adaptation of both spatial dimensions and semantic representations.