nisten
/

quad-mixtrals-gguf

Inference Endpoints

Model card Files Files and versions Community

quad-mixtrals-gguf / README.md

nisten's picture

Update README.md

8f79ba9 10 months ago

|

No virus

659 Bytes

	---
	license: apache-2.0
	---

	Experimental quants of 4 expert MoE mixtrals in various GGUF formats.

	Goal is to have the best performing MoE < 10gb

	Experimental q8 and q4 files for training/finetuning too.

	*No sparsity tricks yet.*

	8.4gb custom 2bit quant works ok up until 512 token length then starts looping.

	- Install llama.cpp from github and run it:


	```bash
	git clone https://github.com/ggerganov/llama.cpp

	cd llama.cpp

	make -j

	wget https://huggingface.co./nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf

	./server -m 4mixq2.gguf --host "my.internal.ip.or.my.cloud.host.name.goes.here.com" -c 512
	```


	limit output to 500 tokens