Text Generation
Transformers
PyTorch
llama
text-generation-inference
Inference Endpoints

example code returns RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

#2
by iekang - opened

Thanks for sharing this amazing model!
When I try to run your example code from my server (with 8-GPUs, CUDA 11.4), I got the errors below, any insight?

Traceback (most recent call last):
File "/mnt/task_runtime/test_olm_llama.py", line 20, in
generation_output = model.generate(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/generation/utils.py", line 1522, in generate
return self.greedy_search(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
outputs = self(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 194, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu
" not implemented for 'Half'

OpenLM Research org

This is likely a result of running it on CPU, where the half-precision ops are not supported. To use it on CPU, you need to convert the data type to float32 before you run any inference.

it would be helpful if you paste the sample code you were testing on.

If you see this line comment it out:
torch.set_default_tensor_type(torch.cuda.HalfTensor)

Where do we find that line? which file?

If you downloaded the model directly from Meta, there should be this python script: /llama/llama/generation.py
You can either comment out line 100, or update it to be:

    if torch.cuda.is_available():
        torch.set_default_tensor_type(torch.cuda.HalfTensor)

so that half-tensors are not used if you don't have CUDA.

It would be helpful if you can paste the code that gave you this error. I was getting this same error when trying out the code present in this tutorial: https://huggingface.co./blog/llama2

For me, changing torch_dtype from torch.float16 to torch.bfloat16 fixed the issue.

For me this replicated the issue in colab:
https://colab.research.google.com/drive/1SDN3rJhyL9EpDWuDVjyE3lJ6hV0Cfdd-?usp=sharing

The error was not resolved by changing torch_dtype from torch.float16 to torch.bfloat16

This is the code that generated the error. If you have a solution, may you also describe the proposed solution's rationale?

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-70b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

This is the code that generated the error. If you have a solution, may you also describe the proposed solution's rationale?

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-70b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

same issue, and by changing torch16 to torch32, its taking forever to load, and consumes 99% of the RAM space and the notebook stops then. If anyone knows why is it happening and any solution to it. Please let me know. Thankyou!

Sign up or log in to comment