Generation too slow

#81
by kurama270296 - opened

Hi everyone!
I succesfully fine tuned this model using Lora on a g5xlarge on aws using a batch size of 4 and a token length of 2000.
Now I'm experiencing a very slowdown during inference. I did a few experiments and I got the following outcomes:

prompt token length = 200, TTFT= 6,20 seconds
prompt token length = 300, TTFT= 9,20 seconds
prompt token length = 400, TTFT= 9,30 seconds
prompt token length = 500, TTFT= 10,20 seconds
prompt token length = 600, TTFT= 11,20 seconds
prompt token length = 700, TTFT= 12,70 seconds
prompt token length = 800, TTFT= 14,10 seconds
prompt token length = 878, TTFT= 16 seconds
NOTE: TTFT = TIME TO FIRST TOKEN
Also I noted that for the subsequent generated tokens I got:

batch size = 4,
prompt length = 878,
output len = 127
total time: 77.94s/it => (77,94-16) = 61,94 seconds from 2° tokens => 126/61,94=2,03 token per second!

So I got like 2 tokens per second which is very slow!

Is that normal according to my hardware? Could it be due to the Lora optimization?
Thanks for your help

Sign up or log in to comment