ibm-ai-platform/llama3-70b-accelerator · Some missing stuffs to run on tgi server

yangxia20000

Jul 31, 2024

•

edited Aug 1, 2024

I think to run this model on tgi server, in "config.json", we need to add "base_model_name_or_path". The "model.safetensors" seems not complete, only one shard

sahilsuneja

Aug 2, 2024

•

edited Aug 2, 2024

@yangxia20000 it should be complete. I tested the model.safetensors version using FMS, and it worked ok.
Does, pytorch_model.bin ckpt version work for you? Maybe the TGIS setup does not like the auto-generated model.safetensors.

yangxia20000

Aug 5, 2024

No description provided.

yangxia20000

Aug 6, 2024

•

edited Aug 6, 2024

Hi, @sahilsuneja thanks a lot for your reply! My understanding is that n_predict is 4, therefore, for emb, we need 4 weights: speculator.emb.0.weight, speculator.emb.1.weight, speculator.emb.2.weight, speculator.emb.3.weight, similarly for ln, proj, head. However, in this model, there is only 1 emb, 1 ln, 2 proj and 1 head. Probably you have some way reuse the weights, otherwise the weights are not enough?

sahilsuneja

Aug 6, 2024

Hi @yangxia20000 , you are right-- the weights are tied. We have tested the FMS implementation (link shared above) and the vLLM implementation of MLPSPeculator() works with these tied-weight ckpts. Does TGIS give you an error for both .bin and .safetensors ckpts?

yangxia20000

Aug 6, 2024

•

edited Aug 6, 2024

Hi @sahilsuneja ,I tried ".safetensors", and it gave errors for can not find emb weights for the second predict. I did not tested " .bin" but guess the same because it does not make sense they support weights reuse for "bin" format but not for "safetensors" format.

sahilsuneja

Aug 6, 2024

Ok, thanks @yangxia20000 . Will take a look and get back to you.

sahilsuneja

Aug 6, 2024

•

edited Aug 6, 2024

@yangxia20000 just to confirm-- are you using IBM TGIS, or HF TGI?
I checked-- the corresponding PR to use this with TGI is here: https://github.com/huggingface/text-generation-inference/pull/2215

yangxia20000

Aug 6, 2024

@sahilsuneja I used HF TGI. Thanks a lot for your great help!

yangxia20000

Aug 9, 2024

•

edited Aug 9, 2024

@sahilsuneja Sorry to trouble you again! Are you still using https://github.com/foundation-model-stack/fms-fsdp/pull/35 to train the speculators? We want to train speculators as needed, but this pr is still experimental. What should we expect to be added for a formal release?

sahilsuneja

Aug 12, 2024

Hi @yangxia20000
PR35 is outdated. We expect to release a stable code version in about 3 weeks.
If you wish to train custom speculators before that I can point you to the specific branches for foundation-model-stack, fms-extras and fms-fsdp repos in the meanwhile

yangxia20000

Aug 13, 2024

@sahilsuneja Thanks a lot for sharing that info! It would be greatly appreciated that you could also share the specific branches!

sahilsuneja

Aug 13, 2024

Ok, here are the specific branches, there is no documentation yet but there are scripts containing examples of the args needed to be sent to train_speculator.py. Working on merging these with their respective mains and adding documentation.
fms-fsdp: https://github.com/sahilsuneja1/fms-fsdp.git + specu_train_ss branch
foundation-model-stack: https://github.com/sahilsuneja1/foundation-model-stack.git + ss_tp_changes branch
fms-extras: https://github.com/sahilsuneja1/fms-extras.git + paged_gpt_bigcode_ss branch

yangxia20000

Aug 13, 2024

@sahilsuneja Thanks a lot for sharing all these works!

sahilsuneja

Sep 10, 2024

@yangxia20000 FYI The speculator training implementation is now available in fms-fsdp main via PR #114

sahilsuneja

Sep 10, 2024

Oh and base_model_name_or_path is also added to config.json as you pointed out.

yangxia20000

Sep 10, 2024

Thanks a lot for sharing it!

yangxia20000

Sep 17, 2024

@sahilsuneja Hi, sorry to trouble you again! Whether is it possible for you to share some experimental results/performance number with MLP speculator for llama3.1 models? Especially on vLLM. Thanks a lot!

sahilsuneja

Sep 17, 2024

•

edited Sep 17, 2024

Hi @yangxia20000 , for llama3.1 we were able to use the speculators trained for llama3. Following are the speedup numbers I could dig up:

llama3.1-8b-instruct speedup using ibm-fms/llama3-8b-accelerator: tokens/step = 2.22
llama3.1-70b-instruct speedup using ibm-fms/llama3-70b-accelerator: tokens/step = 2.05, vLLM end-to-end speedup = 1.5x

yangxia20000

Sep 17, 2024

•

edited Sep 17, 2024

@sahilsuneja Thanks a lot for sharing the results. I did a simple benchmark on mt_bench. The llama3-70b-accelerator results for llama3-70bare similar, but the results for llama3-8b-accelerator are quite different. I got average tokens/step=1.817167 (this is for llama3.1 8b. for llama3-8b, it is 1.930691). Besides, the latency per step is 1.573471x larger (on vLLM, and I set the max-model-len=32000). Thus, the speedup is only 1.1548779736X for batch size =1. My experiments were running on vLLM, thus, there is no tree-attention. What datasets did you use and can I reproduce that?

sahilsuneja

Sep 18, 2024

Thanks for testing the accelerators @yangxia20000 !
We used samples from commoncrawl dataset to test the speedup.
I don't have the end-to-end vllm speedup numbers for llama3-8b. What's the difference between how you calculate the 1.8x and 1.15x numbers speedup numbers?

yangxia20000

Sep 18, 2024

@sahilsuneja Hi, thanks a lot for your reply! 1.8x is accept length, which is the number of tokens per step or forward pass. 1.15x is the really speedups on vLLM, because the time per step also increases. My question is that since the accept length should be the same for different backend (with same payload, output length, temperature, etc). I will try to reproduce the same accept_length with commoncrawl. Just to confirm: The tokens/step you have is the average value of the dataset, and it is not based on tree-attention, right? Is the temperature set as 0.0 ? Thanks again for your help!

sahilsuneja

Sep 19, 2024

Hi @yangxia20000 , thanks for the clarification.
Yes, tokens/step is the average value over 256 randomly selected commoncrawl prompts.
The 2.22x speedup experiment used 5 num_candidates with [6,4,3,3] as the number of top k predictions from each head, if that helps.
I will also ask someone from the team to look into the practical/end-to-end 1.15x vllm speedup you reported.

sahilsuneja

Sep 19, 2024

Ok, I ran it again with the vllm setup == num_candidates=1 and [1,1,1,1] top_k_tokens_per_head, and I get ~1.8x tokens/step as you observed. #consistent

yangxia20000

Sep 19, 2024

@sahilsuneja Thanks a lot for sharing the details. Then I think it makes sense because vLLM only supports top-1 prediction, and the accept_length should be smaller. Thanks a lot!

yangxia20000

Sep 19, 2024

@sahilsuneja thanks!

yangxia20000

Sep 19, 2024

@sahilsuneja One implication is that, without tree-attention, can we expect that we can use speculators trained from llama3 for llama3.1? The difference on accept_length is small with tree-attention, but this might be because that we speculate many continuations with tree-attention? I benchmarked on mt_bench, the accept_length for llama3-8b is 1.930691, while 1.817167 for llama3.1-8b. Is this gap as expected by you? Thanks!

sahilsuneja

Sep 20, 2024

•

edited Sep 20, 2024

Hi @yangxia20000 , yes that diff in tokens/step between llama3-8b and llama3.1-8b is expected.
Regarding tree attention contributing to the speculator reuse, I think you have a good point. Tagging @daviswer for comment.