quad-mixtrals-gguf / README.md
nisten's picture
Update README.md
641b075
|
raw
history blame
No virus
657 Bytes
metadata
license: apache-2.0

Experimental quants of 4 expert MoE mixtrals in various GGUF formats.

Goal is to have the best performing MoE < 10gb

Experimental q8 and q4 files for training/finetuning too.

  • No sparsity tricks yet.

8.4gb custom 2bit quant works ok up until 512 token length then starts looping.

Install llama.cpp from github and run it:

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j 

wget https://huggingface.co./nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf

./server -m 4mixq2.gguf --host "ec2-3-99-206-122.ca-central-1.compute.amazonaws.com" -c 512

limit output to 500 tokens