load_in_8bit error

#2
by BBLL3456 - opened

I could load Baichuan version 1 in 8bit but cannot load version 2, has the following error:

ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co./docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

Baichuan Intelligent Technology org

Can you post your code?

I used the web_demo.py on github and just added the load_in_8bit. I can load the version 2 with load_in_4bit

def init_model():
model = AutoModelForCausalLM.from_pretrained(
"./model/Baichuan2-13B-Chat",
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True,
trust_remote_code=True

Baichuan Intelligent Technology org

I cannot reproduce your error. Did you pull the latest code?

我在进行int8操作的时候 同样遇到了这个问题

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')

File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4

TypeError: 'BitsAndBytesConfig' object is not subscriptable

I cannot reproduce your error. Did you pull the latest code?

Yes, it is the latest, including the change for 'BitsAndBytesConfig' object is not subscriptable.

I am not sure if it makes a difference, I downloaded the files locally and put them in ./model folder

@XuWave you need to download the latest modeling_baichuan.py
But there would still be an error for running 8bit, running 4bit is ok.

I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.

This comment has been hidden

@BBLL3456 内存90GB,显存32GB

Baichuan Intelligent Technology org

I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.

For int8, 13B-Chat will cost 14.2GiB memory.

Baichuan Intelligent Technology org

我在进行int8操作的时候 同样遇到了这个问题

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')

File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4

TypeError: 'BitsAndBytesConfig' object is not subscriptable

The code is not latest.
"

--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
"
is been changed to :
if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):

I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.

For int8, 13B-Chat will cost 14.2GiB memory.

I have 16GB GPU 32GB RAM - can't load 8bit version 2.

我在进行int8操作的时候 同样遇到了这个问题

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')

File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4

TypeError: 'BitsAndBytesConfig' object is not subscriptable

The code is not latest.
"

--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
"
is been changed to :
if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):

Yes I know, i was just replying to @XuWave . I am using the latest code. Like i said, i can run the 4bit with no issue and I am pretty sure the 13B v2 is drawing on more GPU than the v1. Could you please look at the codes of V1 and V2? I am using the same environment for both

Baichuan Intelligent Technology org

Yes, V2 will use more memory than V1, The cause is mainly the serval factors below:

  1. the vocab is 2x times than V1;
  2. quantizer is mix-precision-8bits quantization-op;
    If we take the fragmentization of gpu-memory into consideration, 15GB is not enough is possible.

Sad, I can't run Baichuan2 8bit on my machine then...

GPU Baichuan2.png

Well according to your Github page, it is supposed to be more efficient than Version 1, and only requires 14.2gb as opposed to 15.8GB in Version 1. By right i should be able to load onto my machine.

Baichuan Intelligent Technology org
This comment has been hidden
Baichuan Intelligent Technology org

I have no idea. On my machine, the memory usage is about 15241971712Bytes / 2**30 = 14.2GiB for 8bit-loading

Baichuan Intelligent Technology org

I have no idea. On my machine, the memory usage is about 15241971712Bytes / 2**30 = 14.2GiB for 8bit-loading

Just now, I tested 13B-int8 gpu memory usage, nvidia-smi show 16.05GB, while invoking torch.cuda.max_allocated_memory(), we get 14.2GB. So there are other memory is used by the model which torch cannot get?

Baichuan Intelligent Technology org

I guess some ops will use more additional memory. I have no idea on how to solve it.

Is the model first loaded in fp32 instead of 16bit? It is the initial loading that caught the error.

I also saw some discussions on this same issue raised on your Github page.

I am pasting the entire error below:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 311, in _handle_cache_miss
cached_result = cache.read_result(value_key)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/home/user/baichuan2/web_demo.py", line 72, in
main()
File "/home/user/baichuan2/web_demo.py", line 51, in main
model, tokenizer = init_model()
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 211, in wrapper
return cached_func(*args, **kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 240, in call
return self._get_or_create_cached_value(args, kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value
return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 320, in _handle_cache_miss
computed_value = self._info.func(*func_args, **func_kwargs)
File "/home/user/baichuan2/web_demo.py", line 13, in init_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/user/.cache/huggingface/modules/transformers_modules/baichuan-inc/Baichuan2-13B-Chat/670d17ee403f45334f53121d72feff623cc37de1/modeling_baichuan.py", line 669, in from_pretrained
return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, *model_args,
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3114, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co./docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

Baichuan Intelligent Technology org

Not really.

Would you be able to provide an int-8 bit version?

Baichuan Intelligent Technology org

Would you be able to provide an int-8 bit version?

We have no plan to provide an int8 bit version by now

BBLL3456 changed discussion status to closed

Sign up or log in to comment