mradermacher
/

model_requests

English

Model card Files Files and versions Community

751

mradermacher commited on Apr 23, 2024

Commit

798ca52

verified ·

1 Parent(s): b3c25df

Update README.md

Browse files

Files changed (1) hide show

README.md +93 -3

README.md CHANGED Viewed

@@ -11,17 +11,107 @@ language:
 # Mini-FAQ
 ## What does the "-i1" mean in "-i1-GGUF"?
-tbd
 ## Why are you doing this?
-tbd
 ## You have amazing hardware!?!?!
-tbd
 ## Why don't you use gguf-split?

 # Mini-FAQ
+## I miss model XXX
+I am not the only one to make quants. For example, Lewdiculous makes high-quality imatrix quants of many
+small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them
+because I saw others already did them, avoiding double work.
+Other notable people which do quants are Nexesenex, bartowski, dranger003 and Artefact2. I'm not saying
+anything about the quality, because I probably forgot some really good folks in this list, and I wouldn't
+even know, anyways. Model creators also often provide their own quants. I sometimes skip models because of that,
+even if the creator might provide far fewer quants than me.
+As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version
+for models where I didn't provide them.
+## I miss quant type XXX
+The quant types I currently do regularly are:
+- static:  Q8_0 IQ3_S Q4_K_S IQ3_M Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ3_XS IQ4_XS
+- imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S
+And they are generally (but not always) generated in the order above, for which there are deep reasons.
+For models roughly less than 10B size, I experimentally generate f16 versions at the moment. Or plan to, it's a bit hacky.
+Older models that pre-date introduction of new quant types generally will have them retrofitted, hopefully
+this year. At least when multiple quant types are missing, as it is hard to justify a big mdoel download
+for just one quant. If you want a quant form the above list and don't want to wait, feel free to request it and I will
+prioritize it to the best of my abilities.
+I specifically do not do Q2_K_S, because I generally think it is not worth it, and IQ4_NL, because it requires
+a lot of computing and is generally completely superseded by IQ4_XS.
+You can always try to change my mind.
 ## What does the "-i1" mean in "-i1-GGUF"?
+"mradermacher imatrix type 1"
+Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it
+fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a
+possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better
+changing it. If I make considerable changes to how I create iomatrix data I will probably bump it to `-i2` and so on.
+since there is some subjectivity/choice in imatrix training data, this also distinguishes it from
+quants by other people who made different choices.
+## What is the imatrix training data you use, can I have a copy?
+My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments)
+taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter
+set for too big or too stubborn models.
+Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to
+not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing
+english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big
+difference. More data are always welcome.
+Unfortunately, I do not have the righhts to publish the testing data, but I might be able to replicate an
+equivalent set in the future and publish set.
 ## Why are you doing this?
+Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to
+source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix,
+only a few quant types, all them very fast to generate.
+I then looked into huggingface more closely than just as  adownload source, and decided uploading would be a
+good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make
+(mostly in free software), so it felt naturally to contribute, even at a minor scale.
+Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing.
+This increased the time required to make such quants by an order of magnitude. And also the management overhead.
+Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably
+would not have started doing this a month later, as I would have been daunted by the complexity and work required.
 ## You have amazing hardware!?!?!
+I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently
+have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are
+Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big
+models on the fast(er) servers.
+Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and
+originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).
+I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix
+uploads are small.
+## How do you create imatrix files for really big models?
+Through a combination of these ingenuous tricks:
+1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XXS or even Q2_K), reducing the size of the model.
+2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and
+   then stream the remaining data from disk for every iteration.
+3. Patience.
+The few evaluations I have suggests that this gives good quality, and my current set-up allows me to
+generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.
 ## Why don't you use gguf-split?