rinna
/

japanese-cloob-vit-b-16

Feature Extraction

Model card Files Files and versions Community

mkshing commited on May 11, 2022

Commit

0a74ca0

•

1 Parent(s): 16394cb

update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -58,4 +58,13 @@ with torch.no_grad():
 print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]
 ```

 print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]
 ```
+# Model architecture
+The model was trained  a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. The text encoder was trained upon the pre-trained Japanese RoBERTa model [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) with the same sentencepiece tokenizer.
+# Training
+The model was trained on [CC12M](https://github.com/google-research-datasets/conceptual-12m) translated the captions to Japanese.
+# License
+[Apache-2.0 license](https://www.apache.org/licenses/LICENSE-2.0)