mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq

ar08

May 5, 2024

0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 2046.97it/s]
e:\1b.env\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Setting pad_token_id to eos_token_id:2 for open-end generation.
User: Hi
Assistant:
e:\1b.env\lib\site-packages\transformers\generation\utils.py:1510: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cuda') before running .generate().
warnings.warn(
Exception in thread Thread-3 (generate):
Traceback (most recent call last):
File "C:\Users\CEDP\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Users\CEDP\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "e:\1b.env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "e:\1b.env\lib\site-packages\transformers\generation\utils.py", line 1622, in generate
result = self._sample(
File "e:\1b.env\lib\site-packages\transformers\generation\utils.py", line 2791, in _sample
outputs = self(
File "e:\1b.env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "e:\1b.env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "e:\1b.env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1208, in forward
outputs = self.model(
File "e:\1b.env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "e:\1b.env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "e:\1b.env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 974, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "e:\1b.env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "e:\1b.env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "e:\1b.env\lib\site-packages\torch\nn\modules\sparse.py", line 163, in forward
return F.embedding(
File "e:\1b.env\lib\site-packages\torch\nn\functional.py", line 2237, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

mobicham

Mobius Labs GmbH org May 6, 2024

Hi! Yeah only gpu runtime is supported. CPU will run very slow with the current implementation and we focus on GPU because the lib is also intended to be used for training.
You use a free gpu on Google colab if you don't have access to a gpu -powered machine!

mobicham changed discussion status to closed May 6, 2024

sdyy

about 1 month ago

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

mobiuslabsgmbh
/

Llama-2-7b-chat-hf_1bitgs8_hqq

Make it usageable for cpu