Quantized Meta AI's LLaMA in 4bit with the help of GPTQ algorithm v2. GPTQ implementation - https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/49efe0b67db4b40eac2ae963819ebc055da64074
Conversion process
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors ./q4/llama7b-4bit-ts-ao-g128-v2.safetensors
- Downloads last month
- 7
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.