tiiuae/falcon-40b-instruct · Inference takes 8 min with 8 Nvidia V100 GPU

May 31, 2023

•

edited Jun 1, 2023

It takes me 7-8 min to do one inference, why?
Hardware:
AWS sagemaker ml.p3dn.24xlarge
8 Nvidia Tesla V100 GPU, 24GB*8

Environment
Torch: 2.0.1+cu117
Accelerate: 0.19.0
transformers: 4.29.1

Use following lines to load the model and do inference:

from transformers import AutoTokenizer, AutoModelForCausalLM
from instruct_pipeline import InstructionTextGenerationPipeline
base_model = 'tiiuae/falcon-40b-instruct' # Does not work with load_8bit, inference taks 8 min...
load_8bit = False
tokenizer = AutoTokenizer.from_pretrained(base_model, padding_side="left")
model_llm = AutoModelForCausalLM.from_pretrained(
base_model, load_in_8bit=load_8bit, device_map="auto", torch_dtype=torch.bfloat16,
cache_dir="/home/ec2-user/SageMaker/model_cache/",trust_remote_code=True)

model_llm.eval()
pipe = InstructionTextGenerationPipeline(model=model_llm, tokenizer=tokenizer)
pipe('your prompt')

FalconLLM

Technology Innovation Institute org Jun 1, 2023

•

edited Jun 1, 2023

p3dn is V100, which does not natively support bfloat16. I'm not sure exactly what the fallback is in this case and why it does not throw any errors, maybe it's running on fp32 behind the scenes?

At this time I would recommend switching to ml.g5.24xlarge or ml.g5.48xlarge while we look into how to best support older hardware.

zkdtckk

Jun 1, 2023

Thank you for the correction and advice, never realized that V100 did not support bfloat16. A10 has 125TFLOPS, while V100 fp32 has 7.8TFLOPS, this explains the huge difference in performance.

zkdtckk

Jun 1, 2023

The time of inference is reduced from 8min to 1min on ml.g5.12xlarge with 4 A10 GPU. Should be even faster if using 48xlarge or A100 instance. Thanks again for your help.

zkdtckk changed discussion status to closed Jun 1, 2023

zkdtckk changed discussion title from Inference takes 8 min with 8 Nvidia A100 GPU to Inference takes 8 min with 8 Nvidia V100 GPU Jun 3, 2023