CUDA error: device-side assert triggered

#5
by bg90 - opened

The meditron-7b model loads in text-generation-webui. Then I write the input and press "generate". But got this error:

  File "/home/me/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/home/me/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered

(run with CUDA_LAUNCH_BLOCKING=1)

From what I read, these device-side asserts are often triggered by an invalid indexing operation.

I am not sure if this is a problem of the model, or instead a problem of my settings (but many models such as llama2 and mistral variants work well on my machine).

I have encountered a similar problem as well, I believe this is because the tokenizer has a size of 32019 while the embedding layer of the model only has a size of 32000, which results in an invalid access of the memory. Any possibility to fix this?

I also got the same issue when I try to decode (generate text) using meditron-70B.

It looks like the versions uploaded so far are only the base models without any type of instruction finetuning or chat finetuning.

In the 70B model card (https://huggingface.co./epfl-llm/meditron-70b#downstream-use) they've specified "Note 1: The above formatting is not required for running the base model (this repository)". Maybe they will upload finetuned versions in the future, considering they also write "Future versions of the tuned models will be released as we enhance model's performance."

It's all a bit strange though considering they recommend the deployment guide on GitHub (https://github.com/epfLLM/meditron/blob/main/deployment/README.md) for how to use the base model: "To run proper generation with this base model, we recommend using a high-throughput and memory-efficient inference engine, such as vLLM, with a UI that supports chat and text generation, such as BetterChatGPT To see more details about model deployment and generation, please see our documentation." But the deployment guide on github assumes the model is already instruction finetuned (and has the <|im_start|> and <|im_end|> tokens...

EPFL LLM Team org

Hi there,
Thank you for bringing this to our attention.
There was indeed a size mismatch between the tokenizer and the model embedding. We uploaded an updated version of the tokenizer along with its configurations. Let us know if this resolves the issue. We appreciate your feedback!

Regarding the confusion with the downstream-use instructions, we want to clarify that the models we uploaded (7b & 70B) are pretrained versions without additional finetuning or instruction-tuning. Therefore, the specified format with <|im_start|> and <|im_end|> were not intended for use with these models.

EPFL LLM Team org

We've updated our deployment document to reflect this better and provide relevant examples for the pretrained models. We're always looking to improve, so your suggestions for enhancing our documentation are most welcome.

Looking forward to your feedback!

Thanks for your update, but I still seem to have this issue, as shown below

>>> tokenizer = AutoTokenizer.from_pretrained("epfl-llm/meditron-7b", token=MY_TOKEN)
>>> model = AutoModelForCausalLM.from_pretrained("epfl-llm/meditron-7b", token=MY_TOKEN)
>>> output = tokenizer(["This is a chest X-ray of a patient with Cardiomegaly"], return_tensors="pt", truncation=True, padding="max_length", max_length=64,)
>>> model(**output)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.conda/envs/clip/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  File "~/.conda/envs/clip/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "~/.conda/envs/clip/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

It seems that the tokenizer, which is indeed smaller than before, is still longer than the embedding layer of the model:

>>> len(tokenizer)
32005
>>> model.model.embed_tokens
Embedding(32000, 4096)

When I ask the tokenizer to pad the input, it will use the padding token with id 32004, which will result in of range error above. If I skip the padding mode, it will then execute properly. Alternatively, if I set tokenizer.pad_token = tokenizer.eos_token, the code can also run properly.

It seems that those five extra tokens are all special tokens:

>>> idxtoword = {v: k for k, v in tokenizer.get_vocab().items()}
>>> idxtoword[32000]
'<CLS>'
>>> idxtoword[32001]
'<SEP>'
>>> idxtoword[32002]
'<EOD>'
>>> idxtoword[32003]
'<MASK>'
>>> idxtoword[32004]
'<PAD>'

If I understand correctly, these tokens are not used during training (since LLAMA doesn't use those tokens as well.) So, I think you may be able to get around this issue by simply removing these tokens.

@zechen-nlp , Thank your for your answer.
I have downloaded the new files. And now I can run inference using oobabooga/text-generation-webui.
Many thanks.

Similar as above - the tokenizer is still adding special tokens to the base model that it is not expecting

Sign up or log in to comment