Onnx clip-v2 inferrence q4,uint8 speed slower than fp16

#37
by eminarcissus - opened

Hi, I'm using 3090 and jina-ai-clip v2 to extracting vectors from a batch of images, but I found fp16 can run in a batch size of around 96 on a 3090 card for around 2-3 seconds of inference time(including proprocessing), but it can only runs a batch size of 48 for both uint8(around 5-8 seconds)/q4(around 2-3 seconds) models, is that normal? Here's the code I'm using for testing. and time used for uint8 is even slower than fp16, and q4 costs similar time for inferencing, is that something I should expecting here?

session = ort.InferenceSession('/home/user/dev/jina-clip-v2/onnx/model_uint8.onnx',providers=['CUDAExecutionProvider'])
image_processor = AutoImageProcessor.from_pretrained('/home/user/dev/jina-clip-v2',trust_remote_code=True)
length = 48 #batch_size
input_ids = np.random.randint(0, 10, (length, 16))
pixel_values = image_processor(images)['pixel_values']
session.run(None, {'input_ids': input_ids, 'pixel_values': np.array(pixel_values)})

Also I found both uint8 and q4 model cannot directly sending uint8 quantized image into model as input, they are asking for fp16 input as well, is there anything I missed here?

Sign up or log in to comment