hidehisa-arai commited on
Commit
e1c09c5
1 Parent(s): 82f1345

update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -2
README.md CHANGED
@@ -1,10 +1,42 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
3
  ---
4
 
5
  # recruit-jp/japanese-clip-vit-b-32-roberta-base
6
 
7
- ## 使い方
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  ```python
10
  import io
@@ -56,4 +88,35 @@ with torch.inference_mode():
56
  probs = image_features @ text_features.T
57
 
58
  print("Label probs:", probs.cpu().numpy()[0])
59
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - ja
5
+ pipeline_tag: feature-extraction
6
+ tags:
7
+ - clip
8
+ - japanese-clip
9
  ---
10
 
11
  # recruit-jp/japanese-clip-vit-b-32-roberta-base
12
 
13
+ ## Overview
14
+
15
+ * **Developed by**: [Recruit Co., Ltd.](https://huggingface.co/recruit-jp)
16
+ * **Model type**: Contrastive Language-Image Pretrained Model
17
+ * **Language(s)**: Japanese
18
+ * **LICENSE**: MIT
19
+
20
+ More details are described in our tech blog post.
21
+ * [日本語CLIP学習済みモデルとその評価用データセットの公開](https://blog.recruit.co.jp/data/articles/japanese-clip/)
22
+
23
+ ## Model Details
24
+
25
+ This model is a Japanese [CLIP](https://arxiv.org/abs/2103.00020). Using this model, you can map Japanese texts and images into the same embedding space.
26
+ You can use this model for tasks such as zero-shot image classification, text-image retrieval, image feature extraction, and so on.
27
+
28
+ This model uses the image encoder of [laion/CLIP-ViT-B-32-laion2B-s34B-b79K](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K) for image encoder and [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) for text encoder.
29
+ This model is trained on Japanese subset of [LAION2B-multi dataset](https://huggingface.co/datasets/laion/laion2B-multi) and is tailored for Japanese language.
30
+
31
+ ## How to use
32
+
33
+ 1. Install packages
34
+
35
+ ```shell
36
+ pip install pillow requests transformers torch torchvision sentencepiece
37
+ ```
38
+
39
+ 2. Run the code below
40
 
41
  ```python
42
  import io
 
88
  probs = image_features @ text_features.T
89
 
90
  print("Label probs:", probs.cpu().numpy()[0])
91
+ ```
92
+
93
+ ## Model Performance
94
+
95
+ We've conducted model performance evaluation on the datasets listed below.
96
+ Since ImageNet V2 and Food101 are datasets from English speaking context, we translated the class label into Japanese before we conduct evaluation.
97
+
98
+ * [ImageNet V2](https://github.com/modestyachts/ImageNetV2_pytorch) test set (Top-1 Accuracy)
99
+ * [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) (Top-1 Accuracy)
100
+ * [Hiragana dataset from ETL Character Database](http://etlcdb.db.aist.go.jp/?lang=ja) (Top-1 Accuracy)
101
+ * [Katakana dataset from ETL Character Database](http://etlcdb.db.aist.go.jp/?lang=ja) (Top-1 Accuracy)
102
+ * [STAIR Captions](http://captions.stair.center/) Image-to-Text Retrieval (Average of Precision@1,5,10)
103
+ * [STAIR Captions](http://captions.stair.center/) Text-to-Image Retrieval (Average of Precision@1,5,10)
104
+ * [jafood101](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jafood101.csv) (Top-1 Accuracy)
105
+ * [jaflower30](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jaflower30.csv) (Top-1 Accuracy)
106
+ * [jafacility20](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jafacility20.csv) (Top-1 Accuracy)
107
+ * [jalandmark10](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jalandmark10.csv) (Top-1 Accuracy)
108
+
109
+ We also evaluated [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k), [laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k), [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16) and [stabilityai/japanese-stable-clip-vit-l-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) on the same datasets.
110
+ Note that stabilityai/japanese-stable-clip-vit-l-16 is trained on STAIR Captions dataset, we skipped evaluation of stability's model on STAIR Captions.
111
+
112
+ | **Model** | **ImageNet V2** | **Food101** | **ETLC-hiragana** | **ETLC-katakana** | **STAIR Captions image-to-text** | **STAIR Captions text-to-image** | **jafood101**| **jaflower30** | **jafacility20** | **jalandmark10** |
113
+ |:---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
114
+ |laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k|**0.471**|**0.742**|0.055|0.029|**0.462**|**0.223**|**0.709**|**0.869**|**0.820**|**0.899**|
115
+ |laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k|0.326|0.508|**0.162**|**0.061**|0.372|0.169|0.609|0.709|0.749|0.846|
116
+ |rinna/japanese-clip-vit-b-16|0.435|0.491|0.014|0.024|0.089|0.034|0.308|0.592|0.406|0.656|
117
+ |stabilityai/japanese-stable-clip-vit-l-16|0.481|0.460|0.013|0.023|-|-|0.413|0.689|0.677|0.752|
118
+ |recruit-jp/japanese-clip-vit-b-32-roberta-base|0.175|0.301|0.030|0.038|0.191|0.102|0.524|0.592|0.676|0.797|
119
+
120
+ ## Training Dataset
121
+
122
+ This model is trained with 128M image-text pairs from the Japanese subset of [LAION2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) dataset.