Note to others trying to run this
The HF version still requires at least 40gb VRAM and my attempts so far to split it across two 3090s have failed.
There's also no requirements file, leaving you guessing which pytorch, einops, transformers, and sentencepiece to use.
Yes, you are right. The huggingface version does not support model parallel, and we suggest use the official sat version: https://github.com/THUDM/CogVLM
Yes, you are right. The huggingface version does not support model parallel, and we suggest use the official sat version: https://github.com/THUDM/CogVLM
If you have the time, consider checking this issue as it is the primary one keeping dual-gpu users from using CogVLM on WSL2: https://github.com/THUDM/CogVLM/issues/56
It seems like a problem of WSL2 and torch multi-gpu support... I have no idea... sorry
if u have two 24GB devices, u can use accelerate
to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch
function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
)
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes='CogVLMDecoderLayer')
model = load_checkpoint_and_dispatch(
model,
'local/path/to/hf/version/chat/model', # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
device_map=device_map,
)
model = model.eval()
# check device for weights if u want to
for n, p in model.named_parameters():
print(f"{n}: {p.device}")
# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
also, thanks for the reminder. the requirement is added in README
if u have two 24GB devices, u can use
accelerate
to dispatch model as demonstrated in the following. it seems that theload_checkpoint_and_dispatch
function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.
This works in WSL2 with two gpus, thank you!
CogVLM is the best captioner out there and to finally get this to run is a great relief.
(And, I see you've already added this as an example, great work ^^ )
if u have two 24GB devices, u can use
accelerate
to dispatch model as demonstrated in the following. it seems that theload_checkpoint_and_dispatch
function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.
import torch import requests from PIL import Image from transformers import AutoModelForCausalLM, LlamaTokenizer from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5') with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( 'THUDM/cogvlm-chat-hf', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True, ) device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes='CogVLMDecoderLayer') model = load_checkpoint_and_dispatch( model, 'local/path/to/hf/version/chat/model', # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala' device_map=device_map, ) model = model.eval() # check device for weights if u want to for n, p in model.named_parameters(): print(f"{n}: {p.device}") # chat example query = 'Describe this image' image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB') inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode inputs = { 'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'), 'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'), 'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'), 'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]], } gen_kwargs = {"max_length": 2048, "do_sample": False} with torch.no_grad(): outputs = model.generate(**inputs, **gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] print(tokenizer.decode(outputs[0]))
Anyone tried to deploy cogvlm (4bit quantization) on multiple GPUs with accelerate?
@2thousand see if this can help
@2thousand see if this can help
Thanks, I just figured it out. we can directly add device_map="auto" in AutoModelForCausalLM.from_pretrained()
tokenizer = LlamaTokenizer.from_pretrained('vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
load_in_4bit=True,
trust_remote_code=True,
device_map="auto"
).eval()
query = 'Describe this image in details.'
image = Image.open('image-path').convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.float16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
can someone create a web demo version of this? I tried adapting the cogvlm web demo using the accelerate code above to allow multi-gpu support in wsl2, but couldn't get it to work.
has anyone gotten a gradio UI version of cogvlm working in wsl2?