|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
**Experimental quants of 4 expert MoE mixtrals in various GGUF formats.** |
|
|
|
**Goal is to have the best performing MoE < 10gb** |
|
|
|
Experimental q8 and q4 files for training/finetuning too. |
|
|
|
***No sparsity tricks yet.*** |
|
|
|
8.4gb custom 2bit quant works ok up until 512 token length then starts looping. |
|
|
|
- Install llama.cpp from github and run it: |
|
|
|
|
|
```bash |
|
git clone https://github.com/ggerganov/llama.cpp |
|
|
|
cd llama.cpp |
|
|
|
make -j |
|
|
|
wget https://huggingface.co./nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf |
|
|
|
./server -m 4mixq2.gguf --host "my.internal.ip.or.my.cloud.host.name.goes.here.com" -c 512 |
|
``` |
|
|
|
|
|
limit output to 500 tokens |