Update README.md
Browse files
README.md
CHANGED
@@ -11,17 +11,107 @@ language:
|
|
11 |
|
12 |
# Mini-FAQ
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
## What does the "-i1" mean in "-i1-GGUF"?
|
15 |
|
16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
## Why are you doing this?
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
## You have amazing hardware!?!?!
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
## Why don't you use gguf-split?
|
27 |
|
|
|
11 |
|
12 |
# Mini-FAQ
|
13 |
|
14 |
+
## I miss model XXX
|
15 |
+
|
16 |
+
I am not the only one to make quants. For example, Lewdiculous makes high-quality imatrix quants of many
|
17 |
+
small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them
|
18 |
+
because I saw others already did them, avoiding double work.
|
19 |
+
|
20 |
+
Other notable people which do quants are Nexesenex, bartowski, dranger003 and Artefact2. I'm not saying
|
21 |
+
anything about the quality, because I probably forgot some really good folks in this list, and I wouldn't
|
22 |
+
even know, anyways. Model creators also often provide their own quants. I sometimes skip models because of that,
|
23 |
+
even if the creator might provide far fewer quants than me.
|
24 |
+
|
25 |
+
As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version
|
26 |
+
for models where I didn't provide them.
|
27 |
+
|
28 |
+
## I miss quant type XXX
|
29 |
+
|
30 |
+
The quant types I currently do regularly are:
|
31 |
+
|
32 |
+
- static: Q8_0 IQ3_S Q4_K_S IQ3_M Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ3_XS IQ4_XS
|
33 |
+
- imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S
|
34 |
+
|
35 |
+
And they are generally (but not always) generated in the order above, for which there are deep reasons.
|
36 |
+
|
37 |
+
For models roughly less than 10B size, I experimentally generate f16 versions at the moment. Or plan to, it's a bit hacky.
|
38 |
+
|
39 |
+
Older models that pre-date introduction of new quant types generally will have them retrofitted, hopefully
|
40 |
+
this year. At least when multiple quant types are missing, as it is hard to justify a big mdoel download
|
41 |
+
for just one quant. If you want a quant form the above list and don't want to wait, feel free to request it and I will
|
42 |
+
prioritize it to the best of my abilities.
|
43 |
+
|
44 |
+
I specifically do not do Q2_K_S, because I generally think it is not worth it, and IQ4_NL, because it requires
|
45 |
+
a lot of computing and is generally completely superseded by IQ4_XS.
|
46 |
+
|
47 |
+
You can always try to change my mind.
|
48 |
+
|
49 |
## What does the "-i1" mean in "-i1-GGUF"?
|
50 |
|
51 |
+
"mradermacher imatrix type 1"
|
52 |
+
|
53 |
+
Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it
|
54 |
+
fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a
|
55 |
+
possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better
|
56 |
+
changing it. If I make considerable changes to how I create iomatrix data I will probably bump it to `-i2` and so on.
|
57 |
+
|
58 |
+
since there is some subjectivity/choice in imatrix training data, this also distinguishes it from
|
59 |
+
quants by other people who made different choices.
|
60 |
+
|
61 |
+
## What is the imatrix training data you use, can I have a copy?
|
62 |
+
|
63 |
+
My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments)
|
64 |
+
taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter
|
65 |
+
set for too big or too stubborn models.
|
66 |
+
|
67 |
+
Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to
|
68 |
+
not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing
|
69 |
+
english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big
|
70 |
+
difference. More data are always welcome.
|
71 |
+
|
72 |
+
Unfortunately, I do not have the righhts to publish the testing data, but I might be able to replicate an
|
73 |
+
equivalent set in the future and publish set.
|
74 |
|
75 |
## Why are you doing this?
|
76 |
|
77 |
+
Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to
|
78 |
+
source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix,
|
79 |
+
only a few quant types, all them very fast to generate.
|
80 |
+
|
81 |
+
I then looked into huggingface more closely than just as adownload source, and decided uploading would be a
|
82 |
+
good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make
|
83 |
+
(mostly in free software), so it felt naturally to contribute, even at a minor scale.
|
84 |
+
|
85 |
+
Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing.
|
86 |
+
This increased the time required to make such quants by an order of magnitude. And also the management overhead.
|
87 |
+
|
88 |
+
Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably
|
89 |
+
would not have started doing this a month later, as I would have been daunted by the complexity and work required.
|
90 |
|
91 |
## You have amazing hardware!?!?!
|
92 |
|
93 |
+
I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently
|
94 |
+
have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are
|
95 |
+
Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big
|
96 |
+
models on the fast(er) servers.
|
97 |
+
|
98 |
+
Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and
|
99 |
+
originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).
|
100 |
+
|
101 |
+
I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix
|
102 |
+
uploads are small.
|
103 |
+
|
104 |
+
## How do you create imatrix files for really big models?
|
105 |
+
|
106 |
+
Through a combination of these ingenuous tricks:
|
107 |
+
|
108 |
+
1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XXS or even Q2_K), reducing the size of the model.
|
109 |
+
2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and
|
110 |
+
then stream the remaining data from disk for every iteration.
|
111 |
+
3. Patience.
|
112 |
+
|
113 |
+
The few evaluations I have suggests that this gives good quality, and my current set-up allows me to
|
114 |
+
generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.
|
115 |
|
116 |
## Why don't you use gguf-split?
|
117 |
|