File size: 659 Bytes
74f8bad
 
 
9768c6c
e065fd7
9768c6c
e065fd7
9768c6c
8f79ba9
3a5fe36
8f79ba9
3a5fe36
 
 
8f79ba9
3a5fe36
 
cdd8bfb
0e9fc27
3a5fe36
0e9fc27
3a5fe36
0e9fc27
3a5fe36
0e9fc27
3a5fe36
994e63d
cdd8bfb
3a5fe36
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
license: apache-2.0
---

**Experimental quants of 4 expert MoE mixtrals in various GGUF formats.**

**Goal is to have the best performing MoE < 10gb**

Experimental q8 and q4 files for training/finetuning too.

***No sparsity tricks yet.***

8.4gb custom 2bit quant works ok up until 512 token length then starts looping.

- Install llama.cpp from github and run it: 


```bash
git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j 

wget https://huggingface.co./nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf

./server -m 4mixq2.gguf --host "my.internal.ip.or.my.cloud.host.name.goes.here.com" -c 512
```


limit output to 500 tokens