Some missing stuffs to run on tgi server

#2
by yangxia20000 - opened

I think to run this model on tgi server, in "config.json", we need to add "base_model_name_or_path". The "model.safetensors" seems not complete, only one shard

ibm-ai-platform org
β€’
edited Aug 2

@yangxia20000 it should be complete. I tested the model.safetensors version using FMS, and it worked ok.
Does, pytorch_model.bin ckpt version work for you? Maybe the TGIS setup does not like the auto-generated model.safetensors.

No description provided.

Hi, @sahilsuneja thanks a lot for your reply! My understanding is that n_predict is 4, therefore, for emb, we need 4 weights: speculator.emb.0.weight, speculator.emb.1.weight, speculator.emb.2.weight, speculator.emb.3.weight, similarly for ln, proj, head. However, in this model, there is only 1 emb, 1 ln, 2 proj and 1 head. Probably you have some way reuse the weights, otherwise the weights are not enough?

ibm-ai-platform org

Hi @yangxia20000 , you are right-- the weights are tied. We have tested the FMS implementation (link shared above) and the vLLM implementation of MLPSPeculator() works with these tied-weight ckpts. Does TGIS give you an error for both .bin and .safetensors ckpts?

Hi @sahilsuneja ,I tried ".safetensors", and it gave errors for can not find emb weights for the second predict. I did not tested " .bin" but guess the same because it does not make sense they support weights reuse for "bin" format but not for "safetensors" format.

ibm-ai-platform org

Ok, thanks @yangxia20000 . Will take a look and get back to you.

ibm-ai-platform org
β€’
edited Aug 6

@yangxia20000 just to confirm-- are you using IBM TGIS, or HF TGI?
I checked-- the corresponding PR to use this with TGI is here: https://github.com/huggingface/text-generation-inference/pull/2215

@sahilsuneja I used HF TGI. Thanks a lot for your great help!

@sahilsuneja Sorry to trouble you again! Are you still using https://github.com/foundation-model-stack/fms-fsdp/pull/35 to train the speculators? We want to train speculators as needed, but this pr is still experimental. What should we expect to be added for a formal release?

ibm-ai-platform org

Hi @yangxia20000
PR35 is outdated. We expect to release a stable code version in about 3 weeks.
If you wish to train custom speculators before that I can point you to the specific branches for foundation-model-stack, fms-extras and fms-fsdp repos in the meanwhile

@sahilsuneja Thanks a lot for sharing that info! It would be greatly appreciated that you could also share the specific branches!

ibm-ai-platform org

Ok, here are the specific branches, there is no documentation yet but there are scripts containing examples of the args needed to be sent to train_speculator.py. Working on merging these with their respective mains and adding documentation.
fms-fsdp: https://github.com/sahilsuneja1/fms-fsdp.git + specu_train_ss branch
foundation-model-stack: https://github.com/sahilsuneja1/foundation-model-stack.git + ss_tp_changes branch
fms-extras: https://github.com/sahilsuneja1/fms-extras.git + paged_gpt_bigcode_ss branch

@sahilsuneja Thanks a lot for sharing all these works!

ibm-ai-platform org

@yangxia20000 FYI The speculator training implementation is now available in fms-fsdp main via PR #114

ibm-ai-platform org

Oh and base_model_name_or_path is also added to config.json as you pointed out.

Thanks a lot for sharing it!

@sahilsuneja Hi, sorry to trouble you again! Whether is it possible for you to share some experimental results/performance number with MLP speculator for llama3.1 models? Especially on vLLM. Thanks a lot!

ibm-ai-platform org
β€’
edited Sep 17

Hi @yangxia20000 , for llama3.1 we were able to use the speculators trained for llama3. Following are the speedup numbers I could dig up:

  • llama3.1-8b-instruct speedup using ibm-fms/llama3-8b-accelerator: tokens/step = 2.22
  • llama3.1-70b-instruct speedup using ibm-fms/llama3-70b-accelerator: tokens/step = 2.05, vLLM end-to-end speedup = 1.5x

@sahilsuneja Thanks a lot for sharing the results. I did a simple benchmark on mt_bench. The llama3-70b-accelerator results for llama3-70bare similar, but the results for llama3-8b-accelerator are quite different. I got average tokens/step=1.817167 (this is for llama3.1 8b. for llama3-8b, it is 1.930691). Besides, the latency per step is 1.573471x larger (on vLLM, and I set the max-model-len=32000). Thus, the speedup is only 1.1548779736X for batch size =1. My experiments were running on vLLM, thus, there is no tree-attention. What datasets did you use and can I reproduce that?

ibm-ai-platform org

Thanks for testing the accelerators @yangxia20000 !
We used samples from commoncrawl dataset to test the speedup.
I don't have the end-to-end vllm speedup numbers for llama3-8b. What's the difference between how you calculate the 1.8x and 1.15x numbers speedup numbers?

@sahilsuneja Hi, thanks a lot for your reply! 1.8x is accept length, which is the number of tokens per step or forward pass. 1.15x is the really speedups on vLLM, because the time per step also increases. My question is that since the accept length should be the same for different backend (with same payload, output length, temperature, etc). I will try to reproduce the same accept_length with commoncrawl. Just to confirm: The tokens/step you have is the average value of the dataset, and it is not based on tree-attention, right? Is the temperature set as 0.0 ? Thanks again for your help!

ibm-ai-platform org

Hi @yangxia20000 , thanks for the clarification.
Yes, tokens/step is the average value over 256 randomly selected commoncrawl prompts.
The 2.22x speedup experiment used 5 num_candidates with [6,4,3,3] as the number of top k predictions from each head, if that helps.
I will also ask someone from the team to look into the practical/end-to-end 1.15x vllm speedup you reported.

ibm-ai-platform org

Ok, I ran it again with the vllm setup == num_candidates=1 and [1,1,1,1] top_k_tokens_per_head, and I get ~1.8x tokens/step as you observed. #consistent

@sahilsuneja Thanks a lot for sharing the details. Then I think it makes sense because vLLM only supports top-1 prediction, and the accept_length should be smaller. Thanks a lot!

@sahilsuneja thanks!

@sahilsuneja One implication is that, without tree-attention, can we expect that we can use speculators trained from llama3 for llama3.1? The difference on accept_length is small with tree-attention, but this might be because that we speculate many continuations with tree-attention? I benchmarked on mt_bench, the accept_length for llama3-8b is 1.930691, while 1.817167 for llama3.1-8b. Is this gap as expected by you? Thanks!

ibm-ai-platform org
β€’
edited Sep 20

Hi @yangxia20000 , yes that diff in tokens/step between llama3-8b and llama3.1-8b is expected.
Regarding tree attention contributing to the speculator reuse, I think you have a good point. Tagging @daviswer for comment.

Sign up or log in to comment