bf16 is only supported on A100+ GPUs
NotImplementedError Traceback (most recent call last)
Cell In[2], line 14
11 gen_kwargs = {"max_length": 2048, "do_sample": False}
13 with torch.no_grad():
---> 14 outputs = model.generate(**inputs, **gen_kwargs)
15 outputs = outputs[:, inputs['input_ids'].shape[1]:]
16 print(tokenizer.decode(outputs[0]))
File ~/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/transformers/generation/utils.py:1718, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1701 return self.assisted_decoding(
1702 input_ids,
1703 assistant_model=assistant_model,
(...)
1714 **model_kwargs,
1715 )
1716 if generation_mode == GenerationMode.GREEDY_SEARCH:
1717 # 11. run greedy search
-> 1718 return self.greedy_search(
...
max(query.shape[-1] != value.shape[-1]) > 32
dtype=torch.bfloat16 (supported: {torch.float32})
has custom scale
bf16 is only supported on A100+ GPUs
unsupported embed per head: 112
You can use FP16 for inference, and that's also possible .You can view our github repository
It should be noted that the video memory required for reasoning this model, if using FP16, requires at least 36G of video memory.