nisten
/

quad-mixtrals-gguf

Inference Endpoints

Model card Files Files and versions Community

nisten commited on Dec 23, 2023

Commit

3a5fe36

•

1 Parent(s): 8b281e3

Update README.md

Files changed (1) hide show

README.md +24 -3

README.md CHANGED Viewed

@@ -2,8 +2,29 @@
 license: apache-2.0
 ---
-Experimental quants of 4 headed mixtrals in various GGUF formats.
-Goal is to have the best performing MoE < 16 Gig .
-They still need training/finetuning

 license: apache-2.0
 ---
+Experimental quants of 4 expert MoE mixtrals in various GGUF formats.
+Goal is to have the best performing MoE < 10gb .
+They still need training/finetuning.
+No sparsity tricks yet.
+8.4gb custom 2bit quant works ok up until 512 token length then starts looping.
+Install llama.cpp from github
+```
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make -j
+wget https://huggingface.co/nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf
+./server -m 4mixq2.gguf --host "ec2-3-99-206-122.ca-central-1.compute.amazonaws.com" -c 512```
+limit output to 500 tokens