blocky blocky blocky
This is probably not the GGUF's or anyone's fault, but I run into this "blocky blocky blocky" issue on oobabooga and can't test the unquantized model.
It seems to run in LM Studio, so I assume that oobabooga just has to update something. Just wanted to know if others are also running into this, and if so I can suggest LM Studio for now.
Probably a lack of update but also I think you need to avoid CUDA offloading for now
I didn't do any offloading to CPU, if you mean that?
I tested out an exl2 quantization at 4bpw and that worked perfectly. So I think it's probably something related to Text Generation WebUI and a missing update of a library (llama.cpp or something).
No you want to do no offloading to GPU, aka leave it all on your CPU
It can appear as a bug in exl2 as well but not sure why it doesn't always appear
You can also enable flash attention for llamacpp which should be able to work around the issue