Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

Inference on BLOOM 165B is too slow

#59
by mayank-mishra - opened

I have a dedicated server (in complete isolation) running 8 A100 80GBs.
The inference time using the HF checkpoints is:

Input: The president of US is
top_k: 5, top_p: 0.9, temperature: 0.7, min_length: 1, max_length: 20
Output: The president of US is a man of his word. He has promised to make America great again.

This took 90 seconds. Is this normal?

All GPUs are being used:

Screen Shot 2022-07-21 at 3.31.21 AM.png

Also getting this warning every time:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Not sure if it is related.

Should I use Megatron-DeepSpeed for inference?
This is quite slow
There are instance where generation speed can hit upto 500 seconds

This is how I am using the model:
43 self.model = AutoModelForCausalLM.from_pretrained(
44 args.model_name,
45 device_map="auto",
46 dtype="auto"
47 )

65 output = self.model.generate(
66 input_ids=torch.tensor(x["input_ids"]),
67 attention_mask=torch.tensor(x["attention_mask"]),
68 top_k=top_k,
69 top_p=top_p,
70 temperature=temperature,
71 min_length=min_length,
72 max_length=max_length
73 )

Hi, same issue here with 8 A6000 48GB.

Takes up to 600 seconds for generating 700 tokens.

I am wondering which speed up methods are most promising. Has someone already gathered experience or can provide any other suggestions?

Hi, same issue here with 8 A6000 48GB.

Takes up to 600 seconds for generating 700 tokens.

I am wondering which speed up methods are most promising. Has someone already gathered experience or can provide any other suggestions?

Interesting, I was seeing times not much different than that using pure CPUs but I figured it was only so slow because it was on CPUs. 8x48GB is only 384GB of GPU RAM though so you may be offloading some to CPU RAM during inference. The model consumes around 650GB of system RAM when using pure CPUs during inference. Still, I would have expected it to be much faster than what I was seeing with no GPUs. I would like to know the setup that HF is using to host the API, it seems decently performant for inference.

Hi, I used Huggingface Accelerate und checked the device map. Everything is indeed placed on the GPUs.
With following device map image.png
the output is image.png
This is looking good so far I would say. I follow the same code as mayank for the forward run. Maybe there is something offloaded on CPU which is not directly visible from the device map (maybe some activations are placed into CPU memory).

BigScience Workshop org

Hi all!
I highly suspect that the model is offloaded on the CPU. I also have access to a 8x A100 80GB and works fine
Can you try with dtype=torch.bfloat16 instead and let us know? Also could you try with accelerate==0.10.0 ?
Thanks

on 8x48, case the offloading is probably done to CPU so, that is expected to be slow. Anyways, I changed some things in my code and 8x A100 80GB is working fine now.
Look at the https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/bloom-inference branch in the parent repo.

on 8x48, case the offloading is probably done to CPU so, that is expected to be slow. Anyways, I changed some things in my code and 8x A100 80GB is working fine now.
Look at the https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/bloom-inference branch in the parent repo.

Could I know what things you changed to make it working fine? Apply Megatron-DeepSpeed? or just doing some configuration to make sure all model parameters and activations not offload to CPU?

Hi @pohunghuang , the max memory method in the following file did the trick:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd26b9c4650c74e4159c7aed60a282176f87ac7f/scripts/inference/bloom-accelerate-inference.py

I would recommend you to write your code by editing this file.
Its considerably faster.
Also, make sure you have 8x A100 80GBs

Thanks @mayank31398 , it's really good and simple solution. Unfortunately we have only 2 x (8 x A6000 48GBs), so multi-node distribution is only way we could take.

BigScience Workshop org

Hi @pohunghuang and all!
We recently released a beta bitsandbytes integration with HuggingFace that would work well for large language models: https://twitter.com/Tim_Dettmers/status/1557343499225219072 I think it would give decent inference speed (at least as fast as the native model) and would work well on A6000 with 2x memory footprint gain. You would just need to install the latest version of accelerate, transformers and bitsandbytes (see precise instructions on the google docs that has been shared on the tweet).
The release is still in beta so I would love to hear from you if this is working on your hardware or not!
Thank you :-) !

Hi @pohunghuang and all!
We recently released a beta bitsandbytes integration with HuggingFace that would work well for large language models: https://twitter.com/Tim_Dettmers/status/1557343499225219072 I think it would give decent inference speed (at least as fast as the native model) and would work well on A6000 with 2x memory footprint gain. You would just need to install the latest version of accelerate, transformers and bitsandbytes (see precise instructions on the google docs that has been shared on the tweet).
The release is still in beta so I would love to hear from you if this is working on your hardware or not!
Thank you :-) !

@ybelkada I want to try bitsandbytes on an 8x A6000 server (CUDA version 11.3) with BLOOM. Unfortunately, the following error throws out.

RuntimeError: Creating a Parameter from an instance of type Int8Params requires that detach() returns an instance of the same type, but return type Tensor. was found instead. To use the type as a Parameter, please correct the detach() semantics defined by __torch_dispatch__() implementation.

I use the code from (https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4) using the model.generate() way from HuggingFace. Do you know how to solve the issue? I installed bitsandbytes==0.31.8 from https://pypi.org/project/bitsandbytes/; the latest transformers package from the master branch also installed the latest Accelerate from pip.

BigScience Workshop org

Hi @pai4451 !
Thanks a lot for the feedback!
I replied to you on the thread at https://github.com/huggingface/transformers/pull/17901#issuecomment-1213855206 and I think that this fix should work
Thanks!

I tested BLOOM with 16x A100 40 GB. That's 640 GB of GPU RAM which should very comfortably fit the whole model. However, with a naive setup, generating 30 tokens took nearly 4 minutes.

In this case, I'd expect max_memory to make no difference since there's so much free VRAM that any reasonable defaults should be OK. However, using the max_memory calculator linked above took the generation time for the same example down to 7 seconds. (I don't know why though. Without it, about 20231MiB are used per device, with it 24439MiB, but the last two devices are not filled.)

This still feels a little slow on a 16 GPU machine but at least it's measured in seconds rather than minutes.

I just wanted to purchase new system/PC for high end large language model inferencing. For what specification should I go?

Sign up or log in to comment