Spaces:
Build error
Build error
feature: add intro page, cleanup descriptions
Browse files- app.py +4 -2
- image2text.py +12 -6
- intro.md +32 -0
- intro.py +6 -0
- text2image.py +2 -10
- text2patch.py +4 -2
app.py
CHANGED
@@ -1,17 +1,19 @@
|
|
1 |
import streamlit as st
|
2 |
|
3 |
import image2text
|
|
|
4 |
import text2image
|
5 |
import text2patch
|
6 |
|
7 |
PAGES = {
|
|
|
8 |
"Text to Image": text2image,
|
9 |
"Image to Text": image2text,
|
10 |
-
"
|
11 |
}
|
12 |
|
13 |
st.sidebar.title("Navigation")
|
14 |
model = st.sidebar.selectbox("Choose a model", ["koclip-base", "koclip-large"])
|
15 |
-
page = st.sidebar.selectbox("
|
16 |
|
17 |
PAGES[page].app(model)
|
|
|
1 |
import streamlit as st
|
2 |
|
3 |
import image2text
|
4 |
+
import intro
|
5 |
import text2image
|
6 |
import text2patch
|
7 |
|
8 |
PAGES = {
|
9 |
+
"Introduction": intro,
|
10 |
"Text to Image": text2image,
|
11 |
"Image to Text": image2text,
|
12 |
+
"Text to Patch": text2patch,
|
13 |
}
|
14 |
|
15 |
st.sidebar.title("Navigation")
|
16 |
model = st.sidebar.selectbox("Choose a model", ["koclip-base", "koclip-large"])
|
17 |
+
page = st.sidebar.selectbox("Navigate to...", list(PAGES.keys()))
|
18 |
|
19 |
PAGES[page].app(model)
|
image2text.py
CHANGED
@@ -14,9 +14,9 @@ def app(model_name):
|
|
14 |
st.title("Zero-shot Image Classification")
|
15 |
st.markdown(
|
16 |
"""
|
17 |
-
This
|
18 |
-
|
19 |
-
|
20 |
"""
|
21 |
)
|
22 |
|
@@ -30,6 +30,7 @@ def app(model_name):
|
|
30 |
|
31 |
with col2:
|
32 |
captions_count = st.selectbox("Number of labels", options=range(1, 6), index=2)
|
|
|
33 |
compute = st.button("Classify")
|
34 |
|
35 |
with col1:
|
@@ -37,7 +38,7 @@ def app(model_name):
|
|
37 |
defaults = ["๊ท์ฌ์ด ๊ณ ์์ด", "๋ฉ์๋ ๊ฐ์์ง", "ํฌ๋ํฌ๋ํ ํ์คํฐ"]
|
38 |
for idx in range(captions_count):
|
39 |
value = defaults[idx] if idx < len(defaults) else ""
|
40 |
-
captions.append(st.text_input(f"Insert
|
41 |
|
42 |
if compute:
|
43 |
if not any([query1, query2]):
|
@@ -61,8 +62,13 @@ def app(model_name):
|
|
61 |
inputs["pixel_values"], axes=[0, 2, 3, 1]
|
62 |
)
|
63 |
outputs = model(**inputs)
|
64 |
-
|
65 |
-
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
col1, col2 = st.beta_columns(2)
|
68 |
with col1:
|
|
|
14 |
st.title("Zero-shot Image Classification")
|
15 |
st.markdown(
|
16 |
"""
|
17 |
+
This demo explores KoCLIP's zero-shot prediction capabilities. The model takes an image and a list of candidate captions from the user and predicts the most likely caption that best describes the given image.
|
18 |
+
|
19 |
+
---
|
20 |
"""
|
21 |
)
|
22 |
|
|
|
30 |
|
31 |
with col2:
|
32 |
captions_count = st.selectbox("Number of labels", options=range(1, 6), index=2)
|
33 |
+
normalize = st.checkbox("Apply Softmax")
|
34 |
compute = st.button("Classify")
|
35 |
|
36 |
with col1:
|
|
|
38 |
defaults = ["๊ท์ฌ์ด ๊ณ ์์ด", "๋ฉ์๋ ๊ฐ์์ง", "ํฌ๋ํฌ๋ํ ํ์คํฐ"]
|
39 |
for idx in range(captions_count):
|
40 |
value = defaults[idx] if idx < len(defaults) else ""
|
41 |
+
captions.append(st.text_input(f"Insert caption {idx+1}", value=value))
|
42 |
|
43 |
if compute:
|
44 |
if not any([query1, query2]):
|
|
|
62 |
inputs["pixel_values"], axes=[0, 2, 3, 1]
|
63 |
)
|
64 |
outputs = model(**inputs)
|
65 |
+
if normalize:
|
66 |
+
name = "normalized prob"
|
67 |
+
probs = jax.nn.softmax(outputs.logits_per_image, axis=1)
|
68 |
+
else:
|
69 |
+
name = "cosine sim"
|
70 |
+
probs = outputs.logits_per_image
|
71 |
+
chart_data = pd.Series(probs[0], index=captions, name=name)
|
72 |
|
73 |
col1, col2 = st.beta_columns(2)
|
74 |
with col1:
|
intro.md
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# KoCLIP
|
2 |
+
|
3 |
+
KoCLIP is a Korean port of OpenAI's CLIP.
|
4 |
+
|
5 |
+
## Models
|
6 |
+
|
7 |
+
We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data.
|
8 |
+
|
9 |
+
| KoCLIP | LM | ViT |
|
10 |
+
|----------------|----------------------|--------------------------------|
|
11 |
+
| `koclip-base` | `klue/roberta-large` | `openai/clip-vit-base-patch32` |
|
12 |
+
| `koclip-large` | `klue/roberta-large` | `google/vit-large-patch16-224` |
|
13 |
+
|
14 |
+
## Data
|
15 |
+
|
16 |
+
KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset.
|
17 |
+
|
18 |
+
While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.
|
19 |
+
|
20 |
+
## Demo
|
21 |
+
|
22 |
+
We present three demos, which each illustrate different use cases of KoCLIP.
|
23 |
+
|
24 |
+
* *Image to Text*: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
|
25 |
+
* *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
|
26 |
+
* *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
|
27 |
+
|
28 |
+
---
|
29 |
+
|
30 |
+
We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo!
|
31 |
+
|
32 |
+
|
intro.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
|
3 |
+
|
4 |
+
def app(*args):
|
5 |
+
with open("intro.md") as f:
|
6 |
+
st.markdown(f.read())
|
text2image.py
CHANGED
@@ -17,17 +17,9 @@ def app(model_name):
|
|
17 |
st.title("Text to Image Search Engine")
|
18 |
st.markdown(
|
19 |
"""
|
20 |
-
This
|
21 |
-
5000 images from [MSCOCO](https://cocodataset.org/#home) 2017 validation set was generated using trained KoCLIP
|
22 |
-
vision model. They are ranked based on cosine similarity distance from input Text query embeddings and top 10 images
|
23 |
-
are displayed below.
|
24 |
|
25 |
-
|
26 |
-
Korean caption annotations. Korean translation of caption annotations were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence).
|
27 |
-
Base model `koclip` uses `klue/roberta` as text encoder and `openai/clip-vit-base-patch32` as image encoder.
|
28 |
-
Larger model `koclip-large` uses `klue/roberta` as text encoder and bigger `google/vit-large-patch16-224` as image encoder.
|
29 |
-
|
30 |
-
Example Queries : ์ปดํจํฐํ๋ ๊ณ ์์ด(Cat playing on a computer), ๊ธธ ์์์ ๋ฌ๋ฆฌ๋ ์๋์ฐจ(Car running on the road)
|
31 |
"""
|
32 |
)
|
33 |
|
|
|
17 |
st.title("Text to Image Search Engine")
|
18 |
st.markdown(
|
19 |
"""
|
20 |
+
This demo explores KoCLIP's use case as a Korean image search engine. We pre-computed embeddings of 5000 images from [MSCOCO](https://cocodataset.org/#home) 2017 validation using KoCLIP's ViT backbone. Then, given a text query from the user, these image embeddings are ranked based on cosine similarity. Top matches are displayed below.
|
|
|
|
|
|
|
21 |
|
22 |
+
Example Queries: ์ปดํจํฐํ๋ ๊ณ ์์ด (Cat playing on a computer), ๊ธธ ์์์ ๋ฌ๋ฆฌ๋ ์๋์ฐจ (Car on the road)
|
|
|
|
|
|
|
|
|
|
|
23 |
"""
|
24 |
)
|
25 |
|
text2patch.py
CHANGED
@@ -25,7 +25,7 @@ def split_image(im, num_rows=3, num_cols=3):
|
|
25 |
def app(model_name):
|
26 |
model, processor = load_model(f"koclip/{model_name}")
|
27 |
|
28 |
-
st.title("Patch-based Relevance
|
29 |
st.markdown(
|
30 |
"""
|
31 |
Given a piece of text, the CLIP model finds the part of an image that best explains the text.
|
@@ -37,6 +37,8 @@ def app(model_name):
|
|
37 |
which will yield the most relevant image tile from a grid of the image. You can specify how
|
38 |
granular you want to be with your search by specifying the number of rows and columns that
|
39 |
make up the image grid.
|
|
|
|
|
40 |
"""
|
41 |
)
|
42 |
|
@@ -46,7 +48,7 @@ def app(model_name):
|
|
46 |
)
|
47 |
query2 = st.file_uploader("or upload an image...", type=["jpg", "jpeg", "png"])
|
48 |
captions = st.text_input(
|
49 |
-
"Enter
|
50 |
value="์ด๊ฑด ์์ธ์ ๊ฒฝ๋ณต๊ถ ์ฌ์ง์ด๋ค.",
|
51 |
)
|
52 |
|
|
|
25 |
def app(model_name):
|
26 |
model, processor = load_model(f"koclip/{model_name}")
|
27 |
|
28 |
+
st.title("Patch-based Relevance Ranking")
|
29 |
st.markdown(
|
30 |
"""
|
31 |
Given a piece of text, the CLIP model finds the part of an image that best explains the text.
|
|
|
37 |
which will yield the most relevant image tile from a grid of the image. You can specify how
|
38 |
granular you want to be with your search by specifying the number of rows and columns that
|
39 |
make up the image grid.
|
40 |
+
|
41 |
+
---
|
42 |
"""
|
43 |
)
|
44 |
|
|
|
48 |
)
|
49 |
query2 = st.file_uploader("or upload an image...", type=["jpg", "jpeg", "png"])
|
50 |
captions = st.text_input(
|
51 |
+
"Enter a prompt to query the image.",
|
52 |
value="์ด๊ฑด ์์ธ์ ๊ฒฝ๋ณต๊ถ ์ฌ์ง์ด๋ค.",
|
53 |
)
|
54 |
|