nisten
/

quad-mixtrals-gguf

Inference Endpoints

Model card Files Files and versions Community

nisten commited on Dec 23, 2023

Commit

cdd8bfb

•

1 Parent(s): 3a5fe36

Update README.md

Files changed (1) hide show

README.md +10 -9

README.md CHANGED Viewed

@@ -2,29 +2,30 @@
 license: apache-2.0
 ---
-Experimental quants of 4 expert MoE mixtrals in various GGUF formats.
-Goal is to have the best performing MoE < 10gb .
 They still need training/finetuning.
-No sparsity tricks yet.
 8.4gb custom 2bit quant works ok up until 512 token length then starts looping.
 Install llama.cpp from github
-```
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-make -j
-wget https://huggingface.co/nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf
-./server -m 4mixq2.gguf --host "ec2-3-99-206-122.ca-central-1.compute.amazonaws.com" -c 512```
 limit output to 500 tokens

 license: apache-2.0
 ---
+#Experimental quants of 4 expert MoE mixtrals in various GGUF formats.
+##Goal is to have the best performing MoE < 10gb .
 They still need training/finetuning.
+####No sparsity tricks yet.
 8.4gb custom 2bit quant works ok up until 512 token length then starts looping.
 Install llama.cpp from github
+```bash
+`git clone https://github.com/ggerganov/llama.cpp`
+`cd llama.cpp`
+`make -j `
+`wget https://huggingface.co/nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf`
+`./server -m 4mixq2.gguf --host "ec2-3-99-206-122.ca-central-1.compute.amazonaws.com" -c 512`
+```
 limit output to 500 tokens