Some missing stuffs to run on tgi server
I think to run this model on tgi server, in "config.json", we need to add "base_model_name_or_path". The "model.safetensors" seems not complete, only one shard
@yangxia20000
it should be complete. I tested the model.safetensors version using FMS, and it worked ok.
Does, pytorch_model.bin
ckpt version work for you? Maybe the TGIS setup does not like the auto-generated model.safetensors.
Hi, @sahilsuneja thanks a lot for your reply! My understanding is that n_predict is 4, therefore, for emb, we need 4 weights: speculator.emb.0.weight, speculator.emb.1.weight, speculator.emb.2.weight, speculator.emb.3.weight, similarly for ln, proj, head. However, in this model, there is only 1 emb, 1 ln, 2 proj and 1 head. Probably you have some way reuse the weights, otherwise the weights are not enough?
Hi @yangxia20000 , you are right-- the weights are tied. We have tested the FMS implementation (link shared above) and the vLLM implementation of MLPSPeculator() works with these tied-weight ckpts. Does TGIS give you an error for both .bin and .safetensors ckpts?
Hi @sahilsuneja ,I tried ".safetensors", and it gave errors for can not find emb weights for the second predict. I did not tested " .bin" but guess the same because it does not make sense they support weights reuse for "bin" format but not for "safetensors" format.
Ok, thanks @yangxia20000 . Will take a look and get back to you.
@yangxia20000
just to confirm-- are you using IBM TGIS, or HF TGI?
I checked-- the corresponding PR to use this with TGI is here: https://github.com/huggingface/text-generation-inference/pull/2215
@sahilsuneja Sorry to trouble you again! Are you still using https://github.com/foundation-model-stack/fms-fsdp/pull/35 to train the speculators? We want to train speculators as needed, but this pr is still experimental. What should we expect to be added for a formal release?
Hi
@yangxia20000
PR35 is outdated. We expect to release a stable code version in about 3 weeks.
If you wish to train custom speculators before that I can point you to the specific branches for foundation-model-stack, fms-extras and fms-fsdp repos in the meanwhile
@sahilsuneja Thanks a lot for sharing that info! It would be greatly appreciated that you could also share the specific branches!
Ok, here are the specific branches, there is no documentation yet but there are scripts containing examples of the args needed to be sent to train_speculator.py. Working on merging these with their respective mains and adding documentation.
fms-fsdp: https://github.com/sahilsuneja1/fms-fsdp.git + specu_train_ss branch
foundation-model-stack: https://github.com/sahilsuneja1/foundation-model-stack.git + ss_tp_changes branch
fms-extras: https://github.com/sahilsuneja1/fms-extras.git + paged_gpt_bigcode_ss branch
@yangxia20000 FYI The speculator training implementation is now available in fms-fsdp main via PR #114
Oh and base_model_name_or_path
is also added to config.json
as you pointed out.
Thanks a lot for sharing it!
@sahilsuneja Hi, sorry to trouble you again! Whether is it possible for you to share some experimental results/performance number with MLP speculator for llama3.1 models? Especially on vLLM. Thanks a lot!
Hi @yangxia20000 , for llama3.1 we were able to use the speculators trained for llama3. Following are the speedup numbers I could dig up:
- llama3.1-8b-instruct speedup using ibm-fms/llama3-8b-accelerator: tokens/step = 2.22
- llama3.1-70b-instruct speedup using ibm-fms/llama3-70b-accelerator: tokens/step = 2.05, vLLM end-to-end speedup = 1.5x
@sahilsuneja
Thanks a lot for sharing the results. I did a simple benchmark on mt_bench
. The llama3-70b-accelerator results for llama3-70b
are similar, but the results for llama3-8b-accelerator are quite different. I got average tokens/step=1.817167 (this is for llama3.1 8b. for llama3-8b, it is 1.930691). Besides, the latency per step is 1.573471x larger (on vLLM, and I set the max-model-len=32000). Thus, the speedup is only 1.1548779736X for batch size =1. My experiments were running on vLLM, thus, there is no tree-attention. What datasets did you use and can I reproduce that?
Thanks for testing the accelerators
@yangxia20000
!
We used samples from commoncrawl dataset to test the speedup.
I don't have the end-to-end vllm speedup numbers for llama3-8b. What's the difference between how you calculate the 1.8x and 1.15x numbers speedup numbers?
@sahilsuneja
Hi, thanks a lot for your reply! 1.8x is accept length
, which is the number of tokens per step or forward pass
. 1.15x is the really speedups on vLLM, because the time per step also increases. My question is that since the accept length
should be the same for different backend (with same payload, output length, temperature, etc). I will try to reproduce the same accept_length
with commoncrawl. Just to confirm: The tokens/step
you have is the average value of the dataset, and it is not based on tree-attention, right? Is the temperature set as 0.0 ? Thanks again for your help!
Hi
@yangxia20000
, thanks for the clarification.
Yes, tokens/step is the average value over 256 randomly selected commoncrawl prompts.
The 2.22x speedup experiment used 5 num_candidates with [6,4,3,3] as the number of top k predictions from each head, if that helps.
I will also ask someone from the team to look into the practical/end-to-end 1.15x vllm speedup you reported.
Ok, I ran it again with the vllm setup == num_candidates=1 and [1,1,1,1] top_k_tokens_per_head, and I get ~1.8x tokens/step as you observed. #consistent
@sahilsuneja
Thanks a lot for sharing the details. Then I think it makes sense because vLLM only supports top-1 prediction, and the accept_length
should be smaller. Thanks a lot!
@sahilsuneja thanks!
@sahilsuneja
One implication is that, without tree-attention, can we expect that we can use speculators trained from llama3 for llama3.1? The difference on accept_length
is small with tree-attention, but this might be because that we speculate many continuations with tree-attention? I benchmarked on mt_bench
, the accept_length
for llama3-8b is 1.930691, while 1.817167 for llama3.1-8b. Is this gap as expected by you? Thanks!
Hi
@yangxia20000
, yes that diff in tokens/step between llama3-8b and llama3.1-8b is expected.
Regarding tree attention contributing to the speculator reuse, I think you have a good point. Tagging
@daviswer
for comment.