--- tags: - sentence-transformers - sentence-similarity - feature-extraction pipeline_tag: sentence-similarity library_name: sentence-transformers language: - ja base_model: - cl-nagoya/ruri-large - Qwen/Qwen2-VL-2B-Instruct license: apache-2.0 --- # SentenceTransformer このモデルは実験的なモデルです。 詳細は[ブログ記事](https://note.com/oshizo/n/n473a0124585b)を、関連するソースコードは[リポジトリ](https://github.com/oshizo/japanese-clip-qwen2_vl/)を参照してください。 テキスト埋め込みモデルは[cl-nagoya/ruri-large](https://huggingface.co./cl-nagoya/ruri-large/tree/main)を利用し、画像エンコーダは[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co./Qwen/Qwen2-VL-2B-Instruct)のViTをベースモデルとしています。 ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Cosine Similarity ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("oshizo/japanese-clip-qwen2_vl-exp-0101", trust_remote_code=True) import io import requests from PIL import Image sentences = [ 'モノクロの男性の肖像写真。軍服を着て石の階段に座っている。', "庭で茶色の犬がこちらを向いて座っている。" ] text_embeddings = model.encode(sentences) text_embeddings.shape # (2, 1024) image_urls = [ 'https://upload.wikimedia.org/wikipedia/commons/7/73/Shigenobu_Okuma_5.jpg', 'https://upload.wikimedia.org/wikipedia/commons/7/78/Akita_inu.jpeg' ] images = [ Image.open(io.BytesIO(requests.get(image_urls[0]).content)).resize((150, 240)), Image.open(io.BytesIO(requests.get(image_urls[1]).content)).resize((240, 150)) ] image_embeddings = model.encode(images) image_embeddings.shape # (2, 1024) similarities = model.similarity(text_embeddings, image_embeddings) similarities # tensor([[0.2573, 0.0105], # [0.0282, 0.2982]]) ```