hidehisa-arai
commited on
Commit
•
e1c09c5
1
Parent(s):
82f1345
update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,42 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
# recruit-jp/japanese-clip-vit-b-32-roberta-base
|
6 |
|
7 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
```python
|
10 |
import io
|
@@ -56,4 +88,35 @@ with torch.inference_mode():
|
|
56 |
probs = image_features @ text_features.T
|
57 |
|
58 |
print("Label probs:", probs.cpu().numpy()[0])
|
59 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- ja
|
5 |
+
pipeline_tag: feature-extraction
|
6 |
+
tags:
|
7 |
+
- clip
|
8 |
+
- japanese-clip
|
9 |
---
|
10 |
|
11 |
# recruit-jp/japanese-clip-vit-b-32-roberta-base
|
12 |
|
13 |
+
## Overview
|
14 |
+
|
15 |
+
* **Developed by**: [Recruit Co., Ltd.](https://huggingface.co/recruit-jp)
|
16 |
+
* **Model type**: Contrastive Language-Image Pretrained Model
|
17 |
+
* **Language(s)**: Japanese
|
18 |
+
* **LICENSE**: MIT
|
19 |
+
|
20 |
+
More details are described in our tech blog post.
|
21 |
+
* [日本語CLIP学習済みモデルとその評価用データセットの公開](https://blog.recruit.co.jp/data/articles/japanese-clip/)
|
22 |
+
|
23 |
+
## Model Details
|
24 |
+
|
25 |
+
This model is a Japanese [CLIP](https://arxiv.org/abs/2103.00020). Using this model, you can map Japanese texts and images into the same embedding space.
|
26 |
+
You can use this model for tasks such as zero-shot image classification, text-image retrieval, image feature extraction, and so on.
|
27 |
+
|
28 |
+
This model uses the image encoder of [laion/CLIP-ViT-B-32-laion2B-s34B-b79K](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K) for image encoder and [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) for text encoder.
|
29 |
+
This model is trained on Japanese subset of [LAION2B-multi dataset](https://huggingface.co/datasets/laion/laion2B-multi) and is tailored for Japanese language.
|
30 |
+
|
31 |
+
## How to use
|
32 |
+
|
33 |
+
1. Install packages
|
34 |
+
|
35 |
+
```shell
|
36 |
+
pip install pillow requests transformers torch torchvision sentencepiece
|
37 |
+
```
|
38 |
+
|
39 |
+
2. Run the code below
|
40 |
|
41 |
```python
|
42 |
import io
|
|
|
88 |
probs = image_features @ text_features.T
|
89 |
|
90 |
print("Label probs:", probs.cpu().numpy()[0])
|
91 |
+
```
|
92 |
+
|
93 |
+
## Model Performance
|
94 |
+
|
95 |
+
We've conducted model performance evaluation on the datasets listed below.
|
96 |
+
Since ImageNet V2 and Food101 are datasets from English speaking context, we translated the class label into Japanese before we conduct evaluation.
|
97 |
+
|
98 |
+
* [ImageNet V2](https://github.com/modestyachts/ImageNetV2_pytorch) test set (Top-1 Accuracy)
|
99 |
+
* [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) (Top-1 Accuracy)
|
100 |
+
* [Hiragana dataset from ETL Character Database](http://etlcdb.db.aist.go.jp/?lang=ja) (Top-1 Accuracy)
|
101 |
+
* [Katakana dataset from ETL Character Database](http://etlcdb.db.aist.go.jp/?lang=ja) (Top-1 Accuracy)
|
102 |
+
* [STAIR Captions](http://captions.stair.center/) Image-to-Text Retrieval (Average of Precision@1,5,10)
|
103 |
+
* [STAIR Captions](http://captions.stair.center/) Text-to-Image Retrieval (Average of Precision@1,5,10)
|
104 |
+
* [jafood101](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jafood101.csv) (Top-1 Accuracy)
|
105 |
+
* [jaflower30](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jaflower30.csv) (Top-1 Accuracy)
|
106 |
+
* [jafacility20](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jafacility20.csv) (Top-1 Accuracy)
|
107 |
+
* [jalandmark10](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jalandmark10.csv) (Top-1 Accuracy)
|
108 |
+
|
109 |
+
We also evaluated [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k), [laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k), [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16) and [stabilityai/japanese-stable-clip-vit-l-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) on the same datasets.
|
110 |
+
Note that stabilityai/japanese-stable-clip-vit-l-16 is trained on STAIR Captions dataset, we skipped evaluation of stability's model on STAIR Captions.
|
111 |
+
|
112 |
+
| **Model** | **ImageNet V2** | **Food101** | **ETLC-hiragana** | **ETLC-katakana** | **STAIR Captions image-to-text** | **STAIR Captions text-to-image** | **jafood101**| **jaflower30** | **jafacility20** | **jalandmark10** |
|
113 |
+
|:---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
114 |
+
|laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k|**0.471**|**0.742**|0.055|0.029|**0.462**|**0.223**|**0.709**|**0.869**|**0.820**|**0.899**|
|
115 |
+
|laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k|0.326|0.508|**0.162**|**0.061**|0.372|0.169|0.609|0.709|0.749|0.846|
|
116 |
+
|rinna/japanese-clip-vit-b-16|0.435|0.491|0.014|0.024|0.089|0.034|0.308|0.592|0.406|0.656|
|
117 |
+
|stabilityai/japanese-stable-clip-vit-l-16|0.481|0.460|0.013|0.023|-|-|0.413|0.689|0.677|0.752|
|
118 |
+
|recruit-jp/japanese-clip-vit-b-32-roberta-base|0.175|0.301|0.030|0.038|0.191|0.102|0.524|0.592|0.676|0.797|
|
119 |
+
|
120 |
+
## Training Dataset
|
121 |
+
|
122 |
+
This model is trained with 128M image-text pairs from the Japanese subset of [LAION2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) dataset.
|