mradermacher treehugg3 commited on
Commit
908b515
·
verified ·
1 Parent(s): 193aef9

Add imatrix computing tips (#746)

Browse files

- Add imatrix computing tips (0428d78df129e05b401baf36b98bf052f6b53c4d)
- Update with imatrix training dataset info (8507cc0203378620d46fbcc92af90ab225d7994b)
- Add link to llama-imatrix (1efd43b211b4a5f5d529f88a1a63b80720dc85f6)


Co-authored-by: Steven Goldfeather <[email protected]>

Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -142,6 +142,35 @@ and then run another command which handles download/computation/upload. Most of
142
  to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
143
  is unfortunately very frequent).
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  ## Why don't you use gguf-split?
146
 
147
  TL;DR: I don't have the hardware/resources for that.
 
142
  to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
143
  is unfortunately very frequent).
144
 
145
+ ## What do I need to do to compute imatrix files for large models?
146
+
147
+ Use [`llama-imatrix`](https://github.com/ggml-org/llama.cpp/blob/master/examples/imatrix/README.md) to compute imatrix files.
148
+
149
+ ### Hardware
150
+
151
+ * RAM: A lot of RAM is required to compute imatrix files. Example: 512 GB is just enough to compute 405B imatrix quants in Q8.
152
+ * GPU: At least 8 GB of memory.
153
+
154
+ ### Dataset
155
+
156
+ * You want to create a dataset that is around double the size of bartowski1182's imatrix dataset. Quality is far more important
157
+ than size. If you don't mind long training times, you can make it massive, but if you go beyond 1 MB there will
158
+ probably be diminishing returns.
159
+ * Your imatrix dataset should contain the typical output the model would generate when used for the workload you plan on using
160
+ the model for. If you plan on using the model as a programming assistant, your imatrix dataset should contain the typical code
161
+ you would ask it to write. The same applies for language. Our dataset is mostly English. If one would use our imatrix models in
162
+ a different language they will likely perform worse than static quants as only a very small portion of our imatrix training data
163
+ is multilingual. We only have the resources to generate single generic imatrix quants so our imatrix dataset must contain examples
164
+ of every common use-case of an LLM.
165
+
166
+ ### Extra tips
167
+
168
+ * Computing 405B imatrix quants in Q8 does not seem to have any noticeable quality impact compared to BF16, so to save on hardware
169
+ requirements, use Q8.
170
+ * Sometimes, a single node may not have enough RAM to compute the imatrix file. In such cases, `llama-rpc` inside llama.cpp can
171
+ be used to combine the RAM/VRAM of multiple nodes. This approach takes longer: computing the 405B imatrix file in BF16 takes
172
+ around 20 hours using 3 nodes with 512 GB, 256 GB, and 128 GB of RAM, compared to 4 hours for Q8 on a single node.
173
+
174
  ## Why don't you use gguf-split?
175
 
176
  TL;DR: I don't have the hardware/resources for that.