The `main` branch for TheBloke/Llama-2-70B-GPTQ appears borked
Using the latest oobabooga/text-generation-webui on runpod. Tried two different GPUs (L40 48 GB and A100 80GB), ExLLama loader.
The model loads successful (nothing in the logs), but fails during the inference:
Traceback (most recent call last):
File "/workspace/text-generation-webui/modules/text_generation.py", line 331, in generate_reply_custom
for reply in shared.model.generate_with_streaming(question, state):
File "/workspace/text-generation-webui/modules/exllama.py", line 98, in generate_with_streaming
self.generator.gen_begin_reuse(ids)
File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 186, in gen_begin_reuse
self.gen_begin(in_tokens)
File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 171, in gen_begin
self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora)
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 849, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 930, in _forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 470, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 388, in forward
key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 525, 64, 128]' is invalid for input of size 537600
Interestingly enough, a very small prompt (like 'Hello') works.
Tried other loaders, similar issues. Tried Llama 2 13b, and it worked.
Tried gptq-4bit-64g-actorder_True
quantization on A100, same error. All settings are default. My steps are literally: start pod, download model, load it, try generate.
Same error here on a A100 80GB.
There's an architecture change with 70B.
ExLLaMA and AutoGPTQ issue.
There's an architecture change with 70B.
ExLLaMA and AutoGPTQ issue.
Do you mean there is a difference between 13b and 70b (former works fine)?
In this case usage instructions and compatibility info should be updated:
https://huggingface.co./TheBloke/Llama-2-70B-GPTQ#how-to-easily-download-and-use-this-model-in-text-generation-webui
Same issue on 2xA6000.
This is because the num_head
of key
and value
in attention for llama 70B is different with num_attention_head
(you can check it from config.json
in model uploaded by meta). That's why in transformers there is new function named repeat_kv
to accomodate this. Exllama and GPTQ not yet done it.
Same here on an A100 80gb.
Yes, you need to update Transformers to the latest version. I should have mentioned that in the README, but it was already 4am and I forgot.
Please run:
pip3 install git+https://github.com/huggingface/transformers
and try again.
There is an architectural change to 70b, yes. They added grouped-query attention which needs to be added to ExLlama. It's not a big change, though, and I'm on it, so be patient. Downloading all these models takes a while. And yes, 7b and 13b don't have this change.
Great, looking forward to! GfL and AutoGPTQ are slow as shit with this ;)
These turn around times are amazing guys, it looks like llama2 support was added to ExLlama. What's that, 24 hours since the OG model dropped?
Awesome! Can confirm that after updating text-generation-webui
and updating pip deps, ExLlama loader worked! Thanks, everyone!