Spaces:
Running
Zero GPU does not support 4-bit quantization with bitsandbytes?
Hi there, just tried to deploy a llm model 0-roleplay with a new Zero GPU space.
I successfully built the space, but encounter following errors when trying to run it.
UPDATE: Found a similar issue here, but it seems the provided soulution does not work for me.
UPDATE AGAIN: Successfully run the space by calling the AutoModelForCausalLM.from_pretrained()
inside the method with @spaces.GPU
. But still wondering why could this happen?
===== Application Startup at 2024-06-10 17:21:51 =====
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
โ Those bitsandbytes warnings are expected on ZeroGPU โ
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/usr/local/lib/python3.10/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
warnings.warn(warning_msg)
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
Downloading shards: 50%|โโโโโ | 1/2 [00:04<00:04, 4.08s/it]
Downloading shards: 100%|โโโโโโโโโโ| 2/2 [00:10<00:00, 5.48s/it]
Downloading shards: 100%|โโโโโโโโโโ| 2/2 [00:10<00:00, 5.27s/it]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|โโโโโโโโโโ| 2/2 [00:00<00:00, 2.96it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 116, in worker_init
torch.move(nvidia_uuid)
File "/usr/local/lib/python3.10/site-packages/spaces/zero/torch.py", line 254, in _move
bitsandbytes.move()
File "/usr/local/lib/python3.10/site-packages/spaces/zero/bitsandbytes.py", line 120, in _move
tensor.data = _param_to_4bit(tensor,
File "/usr/local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
return self._quantize(device)
File "/usr/local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 532, in process_events
response = await route_utils.call_process_api(
File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1928, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1512, in call_function
prediction = await fn(*processed_input)
File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 799, in async_wrapper
response = await f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/gradio/chat_interface.py", line 546, in _submit_fn
response = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 177, in gradio_handler
raise res.value
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
@tanyuzhou interesting. I think what it's trying to do is to somehow quantize an already int8 weight? Can you use quanto instead? AFAIK it's more up-to-date by means of maintenance/transformers compatibility https://huggingface.co./docs/transformers/main/en/quantization/quanto
@tanyuzhou interesting. I think what it's trying to do is to somehow quantize an already int8 weight? Can you use quanto instead? AFAIK it's more up-to-date by means of maintenance/transformers compatibility https://huggingface.co./docs/transformers/main/en/quantization/quanto
Hi @merve thanks for paying attention to this issue! I don't know a lot of quantization, is it possible to use quanto on a model that already been quantized to 4-bit?
BTW, I tried to work on this issue in past a few hours, and I think the issue might related to the way how Hunggingface wrapper the gpu inference method with @spaces.GPU
.
First, I tried to switch my space to use A10G small spec. And it turns out everything worked fine. So I think the model is quantized properly.
Then, I move the AutoModelForCausalLM.from_pretrained("Rorical/0-roleplay", return_dict=True, trust_remote_code=True)
inside the response()
method, which is decorated by @spaces.GPU
in order to load the pretrained model with a GPU environment, which also worked too. (here is the code)
But I am still considering my original code should work as there is another space with a 4-bit quantized model call AutoModelForCausalLM.from_pretrained()
before __main__
and work fine.