THUDM/cogvlm-chat-hf · bf16 is only supported on A100+ GPUs

Dec 21, 2023

NotImplementedError Traceback (most recent call last)
Cell In[2], line 14
11 gen_kwargs = {"max_length": 2048, "do_sample": False}
13 with torch.no_grad():
---> 14 outputs = model.generate(**inputs, **gen_kwargs)
15 outputs = outputs[:, inputs['input_ids'].shape[1]:]
16 print(tokenizer.decode(outputs[0]))

File ~/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/transformers/generation/utils.py:1718, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1701 return self.assisted_decoding(
1702 input_ids,
1703 assistant_model=assistant_model,
(...)
1714 **model_kwargs,
1715 )
1716 if generation_mode == GenerationMode.GREEDY_SEARCH:
1717 # 11. run greedy search
-> 1718 return self.greedy_search(
...
max(query.shape[-1] != value.shape[-1]) > 32
dtype=torch.bfloat16 (supported: {torch.float32})
has custom scale
bf16 is only supported on A100+ GPUs
unsupported embed per head: 112

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Feb 7

You can use FP16 for inference, and that's also possible .You can view our github repository

It should be noted that the video memory required for reasoning this model, if using FP16, requires at least 36G of video memory.

zRzRzRzRzRzRzR changed discussion status to closed Feb 7