load_in_8bit error
I could load Baichuan version 1 in 8bit but cannot load version 2, has the following error:
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True
and pass a custom
device_map
to from_pretrained
. Check
https://huggingface.co./docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
Can you post your code?
I used the web_demo.py on github and just added the load_in_8bit. I can load the version 2 with load_in_4bit
def init_model():
model = AutoModelForCausalLM.from_pretrained(
"./model/Baichuan2-13B-Chat",
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True,
trust_remote_code=True
I cannot reproduce your error. Did you pull the latest code?
我在进行int8操作的时候 同样遇到了这个问题
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')
File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4
TypeError: 'BitsAndBytesConfig' object is not subscriptable
I cannot reproduce your error. Did you pull the latest code?
Yes, it is the latest, including the change for 'BitsAndBytesConfig' object is not subscriptable.
I am not sure if it makes a difference, I downloaded the files locally and put them in ./model folder
@XuWave
you need to download the latest modeling_baichuan.py
But there would still be an error for running 8bit, running 4bit is ok.
I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.
I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.
For int8, 13B-Chat will cost 14.2GiB memory.
我在进行int8操作的时候 同样遇到了这个问题
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4TypeError: 'BitsAndBytesConfig' object is not subscriptable
The code is not latest.
"
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
"
is been changed to :
if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):
I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.
For int8, 13B-Chat will cost 14.2GiB memory.
I have 16GB GPU 32GB RAM - can't load 8bit version 2.
我在进行int8操作的时候 同样遇到了这个问题
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4TypeError: 'BitsAndBytesConfig' object is not subscriptable
The code is not latest.
"--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
"
is been changed to :
if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):
Yes I know, i was just replying to @XuWave . I am using the latest code. Like i said, i can run the 4bit with no issue and I am pretty sure the 13B v2 is drawing on more GPU than the v1. Could you please look at the codes of V1 and V2? I am using the same environment for both
Yes, V2 will use more memory than V1, The cause is mainly the serval factors below:
- the vocab is 2x times than V1;
- quantizer is mix-precision-8bits quantization-op;
If we take the fragmentization of gpu-memory into consideration, 15GB is not enough is possible.
Sad, I can't run Baichuan2 8bit on my machine then...
I have no idea. On my machine, the memory usage is about 15241971712Bytes / 2**30 = 14.2GiB for 8bit-loading
I have no idea. On my machine, the memory usage is about 15241971712Bytes / 2**30 = 14.2GiB for 8bit-loading
Just now, I tested 13B-int8 gpu memory usage, nvidia-smi show 16.05GB, while invoking torch.cuda.max_allocated_memory(), we get 14.2GB. So there are other memory is used by the model which torch cannot get?
I guess some ops will use more additional memory. I have no idea on how to solve it.
Is the model first loaded in fp32 instead of 16bit? It is the initial loading that caught the error.
I also saw some discussions on this same issue raised on your Github page.
I am pasting the entire error below:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 311, in _handle_cache_miss
cached_result = cache.read_result(value_key)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/home/user/baichuan2/web_demo.py", line 72, in
main()
File "/home/user/baichuan2/web_demo.py", line 51, in main
model, tokenizer = init_model()
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 211, in wrapper
return cached_func(*args, **kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 240, in call
return self._get_or_create_cached_value(args, kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value
return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 320, in _handle_cache_miss
computed_value = self._info.func(*func_args, **func_kwargs)
File "/home/user/baichuan2/web_demo.py", line 13, in init_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/user/.cache/huggingface/modules/transformers_modules/baichuan-inc/Baichuan2-13B-Chat/670d17ee403f45334f53121d72feff623cc37de1/modeling_baichuan.py", line 669, in from_pretrained
return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, *model_args,
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3114, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True
and pass a custom
device_map
to from_pretrained
. Check
https://huggingface.co./docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
Not really.
Would you be able to provide an int-8 bit version?
Would you be able to provide an int-8 bit version?
We have no plan to provide an int8 bit version by now