how to infer text-img pair demo?
#2
by
WinstonDeng
- opened
Using openai official text model, text embedding dim is 768, mismatching with llm2clip img embedding dim 1280.
text_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14-336")
inputs = tokenizer(text=texts, padding=True, return_tensors="pt").to(device)
text_features = text_model.get_text_features(**inputs) # [1, 768]