about 310 tokens/s

#6
by lchustc - opened

Hi,i reproduced the LyraChatGLM in my A100 and got the speed of 71 tokens/s.
I wonder under what settings you achieved your 310 tokens/s?
I have tested the speed of the LyraChatGLM and the raw ChatGLM under condition batch_size=1(31 tokens/s and 71 tokens/s).

Tencent Music Entertainment Lyra Lab org
edited May 17, 2023

@lchwhut Short answer: try batch_size = 8

to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)

@lchwhut Short answer: try batch_size = 8

to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)

Hi,I have achieved correct batch(bs=8) inference of original ChatGLM and got the speed of 137 token/s in my A100.
You can see this issue https://github.com/THUDM/ChatGLM-6B/issues/745

Tencent Music Entertainment Lyra Lab org

@lchwhut good job! we started this project based on old version and didn't notice this update.

I'll update readme to make it clearer

Hi, @bigmoyan
what's the input length for 310 tokens/s?
Thanks.

@lchwhut Short answer: try batch_size = 8

to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)
@bigmoyan @lchustc I got 70 tokens/s in my A100 with batch_size = 8.
Can you share your demo code with batch_size=8?

Tencent Music Entertainment Lyra Lab org

Everything is updated. Please try the new version.

vanewu changed discussion status to closed

Sign up or log in to comment