Generation too slow
Hi everyone!
I succesfully fine tuned this model using Lora on a g5xlarge on aws using a batch size of 4 and a token length of 2000.
Now I'm experiencing a very slowdown during inference. I did a few experiments and I got the following outcomes:
prompt token length = 200, TTFT= 6,20 seconds
prompt token length = 300, TTFT= 9,20 seconds
prompt token length = 400, TTFT= 9,30 seconds
prompt token length = 500, TTFT= 10,20 seconds
prompt token length = 600, TTFT= 11,20 seconds
prompt token length = 700, TTFT= 12,70 seconds
prompt token length = 800, TTFT= 14,10 seconds
prompt token length = 878, TTFT= 16 seconds
NOTE: TTFT = TIME TO FIRST TOKEN
Also I noted that for the subsequent generated tokens I got:
batch size = 4,
prompt length = 878,
output len = 127
total time: 77.94s/it => (77,94-16) = 61,94 seconds from 2° tokens => 126/61,94=2,03 token per second!
So I got like 2 tokens per second which is very slow!
Is that normal according to my hardware? Could it be due to the Lora optimization?
Thanks for your help