Spaces:
Build error
Build error
docs: add multilinguality to findings section
Browse files
intro.md
CHANGED
@@ -25,7 +25,11 @@ We present three demos, which each illustrate different use cases of KoCLIP.
|
|
25 |
* *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
|
26 |
* *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
|
27 |
|
28 |
-
##
|
|
|
|
|
|
|
|
|
29 |
|
30 |
We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
|
31 |
|
@@ -35,9 +39,41 @@ We found that KoCLIP performs better when prompting is used to induce zero-shot
|
|
35 |
|
36 |
noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
## Future Work
|
39 |
|
40 |
-
Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
---
|
43 |
|
|
|
25 |
* *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
|
26 |
* *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
|
27 |
|
28 |
+
## Findings
|
29 |
+
|
30 |
+
In this section, we detail some interesting findings we made throughout the project.
|
31 |
+
|
32 |
+
### Prompting
|
33 |
|
34 |
We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
|
35 |
|
|
|
39 |
|
40 |
noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
|
41 |
|
42 |
+
### Multilinguality
|
43 |
+
|
44 |
+
Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingl well for simple words (e.g. "dog"). This could be one of two reasons, or a combination thereof:
|
45 |
+
|
46 |
+
* *ViT Pretraining*: The ViT backbone for `koclip-base`, `openai/clip-vit-base-patch32`, was already pretrained on an English image captioning dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is the fact that `koclip-large` also demonstrates limited multilingual behavior.
|
47 |
+
|
48 |
+
* *LM Knowledge Bleed*: `klue/roberta-large` was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both `koclip-base` and `koclip-large`. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."
|
49 |
+
|
50 |
## Future Work
|
51 |
|
52 |
+
Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompt engineering is somewhat of a mystery and an active area of ongoing research, we hope to explore more scientific approaches to the topic.
|
53 |
+
|
54 |
+
## References
|
55 |
+
|
56 |
+
```bibtex
|
57 |
+
@misc{park2021klue,
|
58 |
+
title={KLUE: Korean Language Understanding Evaluation},
|
59 |
+
author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
|
60 |
+
year={2021},
|
61 |
+
eprint={2105.09680},
|
62 |
+
archivePrefix={arXiv},
|
63 |
+
primaryClass={cs.CL}
|
64 |
+
}
|
65 |
+
```
|
66 |
+
|
67 |
+
```bibtex
|
68 |
+
@misc{radford2021learning,
|
69 |
+
title={Learning Transferable Visual Models From Natural Language Supervision},
|
70 |
+
author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
|
71 |
+
year={2021},
|
72 |
+
eprint={2103.00020},
|
73 |
+
archivePrefix={arXiv},
|
74 |
+
primaryClass={cs.CV}
|
75 |
+
}
|
76 |
+
```
|
77 |
|
78 |
---
|
79 |
|