RuntimeError: CUDA error: an illegal memory access was encountered
Before you dismiss this as "something is wrong with your setup":
- I can run other models without getting this
- It happens with multiple loaders (have tried exllama and rwkv)
- It happens on multiple of your models (have verified on this model and its 13B counterpart)
- When I mentioned it on the Oobabooga Discord, another person chimed up that they had the exact same problem with one of your models ("I got that illegal memory access error last night with ExLlama and TheBloke_WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GPTQ").
- My text-generation-webui is fully up to date.
That out of the way: I can run this (great!) model at first on my RTX 3090, but after a dozen or two generations, it switches to giving only:
Exception occurred during processing of request from ('127.0.0.1', 57152)
Traceback (most recent call last):
File "/usr/lib64/python3.10/socketserver.py", line 683, in process_request_thread
self.finish_request(request, client_address)
File "/usr/lib64/python3.10/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib64/python3.10/socketserver.py", line 747, in init
self.handle()
File "/usr/lib64/python3.10/http/server.py", line 433, in handle
self.handle_one_request()
File "/usr/lib64/python3.10/http/server.py", line 421, in handle_one_request
method()
File "/home/user/text-generation-webui/extensions/api/blocking_api.py", line 86, in do_POST
for a in generator:
File "/home/user/text-generation-webui/modules/chat.py", line 317, in generate_chat_reply
for history in chatbot_wrapper(text, history, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
File "/home/user/text-generation-webui/modules/chat.py", line 234, in chatbot_wrapper
for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)):
File "/home/user/text-generation-webui/modules/text_generation.py", line 23, in generate_reply
for result in _generate_reply(*args, **kwargs):
File "/home/user/text-generation-webui/modules/text_generation.py", line 176, in _generate_reply
clear_torch_cache()
File "/home/user/text-generation-webui/modules/models.py", line 309, in clear_torch_cache
torch.cuda.empty_cache()
File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/memory.py", line 133, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
I've tried running it in a number of ways, but for example:
CUDA_VISIBLE_DEVICES=0 python server.py --listen --listen-port 1234 --loader exllama --model "TheBloke_Wizard-Vicuna-30B-Uncensored-GPTQ" --api --verbose
For simplicity I'm calling it with the non-streaming api.py script, only minimally modified (different text provided, and set to make calls repeatedly in a loop).
====
Also, and this is offtopic, but: the random seed, or setting seed to -1, seems to make no difference: given a fixed input, it seems to only want to make the same output. Just an aside in case you have an easy solution.
OK, interesting. I'm confused by you saying you tested it with the RWKV loader? Isn't that just for RWKV models? If not can you link me to info on using that loader for GPTQ models so I can see how to test it.
Otherwise it looks most like an ExLlama specific problem. I haven't tested this model specifically with ExLlama, though I see that the 13B model is shown as compatible on ExLlama's model compatibility list.
Let me know regarding RKWV so I can understand if this is exclusive to ExLlama or not.
I don't have experience on the technical side with loaders (they didn't even exist back when I had been using the UI previously). All I did was specify --loader rwkv and the model loaded and ran. But experienced the same problems. Perhaps it just fell back to exllama and this might be an exllama issue? The other user reporting the same problem was using exllama as well.
OK yeah I think it would have ignored --loader rkwv
as that's only for RWKV models.
Leave it with me and I'll test it myself. If I can re-create it with ExLlama I'll raise an issue on the ExLlama github.
Have you tried --loader autogptq?
Yeah, autogptq didn't run. I'd need to debug what requirements or whatnot it's wanting. I could ask the other user who experienced this problem to do so if you think it'd help.
ED: the autogptq error was:
2023-06-20 18:40:41 INFO:Loading settings from settings.json...
2023-06-20 18:40:41 INFO:Loading TheBloke_Wizard-Vicuna-30B-Uncensored-GPTQ...
2023-06-20 18:40:42 INFO:The AutoGPTQ params are: {'model_basename': 'Wizard-Vicuna-30B-Uncensored-GPTQ-4bit.act.order', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
Traceback (most recent call last):
File "/home/user/text-generation-webui/server.py", line 1003, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/user/text-generation-webui/modules/models.py", line 65, in load_model
output = load_func_maploader
File "/home/user/text-generation-webui/modules/models.py", line 271, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
File "/home/user/text-generation-webui/modules/AutoGPTQ_loader.py", line 55, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
File "/home/user/.local/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 79, in from_quantized
model_type = check_and_get_model_type(save_dir or model_name_or_path, trust_remote_code)
File "/home/user/.local/lib/python3.10/site-packages/auto_gptq/modeling/_utils.py", line 125, in check_and_get_model_type
raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: llama isn't supported yet.
(For the record, I have a viable, though slightly annoying workaround going, which is: whenever the API calls return an error, it automatically kills the server, which is running in a while loop, so it starts back up, and the script can continue ;) )