Thank you very much!
I just want to thank you, because I have been waiting for this type of model, and you are the first one who made these available here. This is the best model I have tried locally this far. Thank you!
You're welcome! Glad it's working well for you.
Also want to thank you for this!
Is it possible to run it on a GPU with HF?
It's possible to run on the GPU yes.
I've done these repos:
4bit GPTQ quantisation: https://huggingface.co./TheBloke/alpaca-lora-65B-GPTQ-4bit
Full unquantised HF format: https://huggingface.co./TheBloke/alpaca-lora-65B-HF
The latter would need 128+ GB of VRAM so that's not likely to be viable for most people. The GPTQ 4bits should hopefully run in 40GB of VRAM, eg 1 x A100 40GB or 2 x 24GB cards like a 3090 or 4090. I haven't actually tested them yet, I'm planning to do so soon. But they should work OK.
Here's an explanation of the three different files on the GPTQ repo. I've not had a chance to add this to the README yet:
alpaca-lora-65B-GPTQ-4bit-128g.safetensors :
GPTQ 4bit 128g with --act-order. Should be highest possible quality quantisation. Will require recent GPTQ-for-LLaMA code; will not work with oobaboog's fork, and therefore won't work with the one-click-installers for Windows.
alpaca-lora-65B-GPTQ-4bit-1024g.safetensors: Same as the above but with a groupsize of 1024. This possibly reduces the quantisation quality slightly, but will require less VRAM. Created with the idea of ensuring this file could load in 40GB VRAM on an A100 - it's possible the 128g will need more than 40GB.
alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors:
GPTQ 4bit 128g without --act-order. Possibly slightly lower accuracy. Will work with oobabooga's GPTQ-for-LLaMA fork and the one-click installers
This probably generates the most ChatGPT 3.5-like responses of any local setup I've tried. Pretty cool. It's slow even on a "fast" by consumer standards computer but I'd rather wait than get useless output.
Yep - that is the most close model to GPT 3.5 ... or even better than GPt 3.5 especially q5_1.
Any ultra max mega tuned models 7B or 13B are not even close to standard alpaca-lora 65B.
I am testing by writing stories capability ..So . are better that GPT 3.5 actually.
Coding seems also better ....
I have a 65GB RAM system (I9-13900k) system with a 4090 video card, but I guess I still need to use the CPU version..how do I most easily install this and get it running? the model page says something would have to be compiled? I currently have textual generation-UI set up and it works with lower models...thanks.
You should be able to use llama.cpp models in text-generation-webui. Check out these docs: https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md
Yep ..llama.cpp is great.
How would 65B-HF run on the Mac with M1 Ma Studio (with 10-core CPU, 24-core GPU, 16-core Neural Engine, 32GB unified memory)?