THUDM/cogvlm-chat-hf · Note to others trying to run this

Nov 21, 2023

•

edited Nov 21, 2023

The HF version still requires at least 40gb VRAM and my attempts so far to split it across two 3090s have failed.
There's also no requirements file, leaving you guessing which pytorch, einops, transformers, and sentencepiece to use.

qingsonglv

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Nov 21, 2023

Yes, you are right. The huggingface version does not support model parallel, and we suggest use the official sat version: https://github.com/THUDM/CogVLM

lzxcgnkhnrlnto

Nov 21, 2023

Yes, you are right. The huggingface version does not support model parallel, and we suggest use the official sat version: https://github.com/THUDM/CogVLM

If you have the time, consider checking this issue as it is the primary one keeping dual-gpu users from using CogVLM on WSL2: https://github.com/THUDM/CogVLM/issues/56

qingsonglv

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Nov 21, 2023

It seems like a problem of WSL2 and torch multi-gpu support... I have no idea... sorry

chenkq

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Nov 21, 2023

•

edited Nov 21, 2023

if u have two 24GB devices, u can use accelerate to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes='CogVLMDecoderLayer')
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

chenkq

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Nov 21, 2023

also, thanks for the reminder. the requirement is added in README

lzxcgnkhnrlnto

Nov 21, 2023

•

edited Nov 21, 2023

if u have two 24GB devices, u can use accelerate to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.

This works in WSL2 with two gpus, thank you!
CogVLM is the best captioner out there and to finally get this to run is a great relief.
(And, I see you've already added this as an example, great work ^^ )

chenkq changed discussion status to closed Nov 24, 2023

2thousand

Dec 10, 2023

if u have two 24GB devices, u can use accelerate to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes='CogVLMDecoderLayer')
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

Anyone tried to deploy cogvlm (4bit quantization) on multiple GPUs with accelerate?

chenkq

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Dec 11, 2023

•

edited Dec 11, 2023

@2thousand see if this can help

2thousand

Dec 11, 2023

•

edited Dec 11, 2023

@2thousand see if this can help

Thanks, I just figured it out. we can directly add device_map="auto" in AutoModelForCausalLM.from_pretrained()

tokenizer = LlamaTokenizer.from_pretrained('vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        load_in_4bit=True,
        trust_remote_code=True,
        device_map="auto"
    ).eval()
query = 'Describe this image in details.'
image = Image.open('image-path').convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.float16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

Hisma

Dec 16, 2023

can someone create a web demo version of this? I tried adapting the cogvlm web demo using the accelerate code above to allow multi-gpu support in wsl2, but couldn't get it to work.
has anyone gotten a gradio UI version of cogvlm working in wsl2?