Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.
I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf
so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.
474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.
I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).
So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
457.4g after warming up.
So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)
llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1
and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?
I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.
dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:
Yes it is clearly streaming from SSD now:
Once the quantisation tasks are interrupted it should work without SSD streaming again.
This is somewhat worrying:
[1]2.9360,[2]2.3937,[3]2.4731,[4]2.5391,[5]2.8621,[6]2.8125,[7]2.6349,[8]2.9891,[9]2.8659,
save_imatrix: entry ' blk.34.ffn_up_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.34.ffn_down_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.34.ffn_gate_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.1.ffn_up_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry ' blk.1.ffn_down_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry ' blk.1.ffn_gate_exps.weight' has partial data (94.53%) - skipping
save_imatrix: storing only 373 out of 385 entries
Yes, one iteration after both quant tasks finished it stopped streaming.. But these are big tasks.
Nope, started again.
As for the quantize tasks, I don't know what is going on. I was also able to see this, but now I am unable to see any processes.
I think it stopped streaming for good. It is possible that it also takes a few iterations for everything to stay in memory.
Top now at 461.3g (495GB). So it isn't tight. Let's see what happens.
This is somewhat worrying:
It should be fine and maybe expected for a MoE model with 128 experts. According to the llama.cpp source code (https://github.com/ggerganov/llama.cpp/blob/d9c3ba2b7749c00df477599aa141a98b4521aa2c/examples/imatrix/imatrix.cpp#L218-L219 ) this warning is part of the code to avoid writing imatrix entries that do not have full data which can happen with MoE models where some of the experts end up not being exercised by the provided training data.
Storing 373 out of 385 entries seams to be good enough.
It's reducing. These look like useful new messages, actually.
[10]3.1400,[11]3.2586,[12]3.0453,[13]3.0821,[14]3.3073,[15]3.5876,[16]3.7071,[17]3.9026,[18]4.0482,[19]4.1979,
save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (89.06%) - skipping
save_imatrix: storing only 379 out of 385 entries
It's reducing. These look like useful new messages, actually.
This is expected as the longer we train the more likely experts are to be included during imatrix training. I'm wondering if MoE models need longer imatrix training compared to monolithic models. This one has 128 experts while only 2 are active for a given token so we only use 1/64th of the model for every token.
If it stays that way, we'll have good chances that the imatrix quantization will fail (if the message means what I think it does). If true, it intuitively makes sense - it's harder to tickle all experts in such a massive MoE model. Well, we have another 330 chunks.
I'm wondering if MoE models need longer imatrix training
Longer is unlikely to help - the right training data, is more likely. The top two (with 99.22%) have not reduced in the last iterations. And good that I save every 10 iterations, I knew someday it would be useful for something :)
Pretty exciting. Anyway, over and out for a while.
What is interesting is that it doesn't show a message for every tensor it skips. And it really is quite fast - obvious in hindsight. But I don't think the remaining chunks will do anything. Let's see if it quants. My prediction would be that it will likely fail with low bit quants.
I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.
-2000 488 snowflake-arctic-instruct run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
42+ 13 Gugugo-koen-7B-V1.1 run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758
Unfortunately there are now even quantisation tasks that started to run:
1 66 I huihui-ai-abliterated-Qwen2.5-32B-Inst-BaseMerge-TIES run/imatrix 9/25,IQ4_XS [705/771] (hfu i1-Q6_K)
Not sure what I should do to pause the quantization tasks. I could pause the entire host but seems a bit overkill and might cause other issues.
If it stays that way, we'll have good chances that the imatrix quantization will fail
I don't think it will fail. It will hopefully just statically quant blk.0.ffn_down_exps.weight, blk.0.ffn_gate_exps.weight and blk.0.ffn_up_exps.weight which should be fine as then the vast majority of the model will have the imatrix applied and it seems unlikely there would be any meaningful real world quality difference. The question is more if llama.cpp is capable of quantizing with a partial imatrix. I don’t think this was ever tested.
The top two (with 99.22%) have not reduced in the last iterations.
(100/128)*127 = 99.21875% => 99.22%
We they are just missing a single expert on a single layer. For some reason none of our training data seem to get routed to this specific expert for the first layer. All other layers already reached full coverage.
Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.
As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.
I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.
Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.
I would assume a very specific expert. I couldn't even come up with 128 different type of experts so I expect some of them to have really specific areas of activation.
As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.
We would ideally prevent the scheduler from starting any tasks while the imatrix of such massive models is being computed. It is not that bad if this happens while running them normally as it will just start streaming from SSD essentially pausing it until there is enough RAM but with RPC running out of RAM will result in a total system crash. I likely should have just let it stream from SSD until you had time to fix it but I know that the /tmp/pause flag is only making new imatrix task wait in an endless loop which unlike pausing the entire host should be safe.
When we are at pausing the performance measurement project is coming along extremely well so soon I will have to pause the entire nico1 host for multiple nights if we want to do the performance measurements on StormPeak. I hope this is not too disruptive or I might not do it. I'm currently doing performance measurments on Threadripper, CastlePeak, Raspberry Pi 4, 7840S Laptop and all of them should be done within the next few days. I will try to keep StormPeak measurement at an absolute minimum and only measure with 32 threads which based on my current result should be the setting that gives the best performance on a 32 core/64 thread CPU.
I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.
Awesome and I see that snowflake imatrix quantizing seam to work! Thanks a lot for doing imatrix quants of this amazing model. If the imatrix quants turn out well we can do the snowflake base model too. I will give them a try tomorrow.
I likely should have just let it stream from
The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag (which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem).
I will have to pause the entire nico1 host for multiple nights
Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.
It's up to you, though, and I can try to cope, but it would add a level of manual managing that I could avoid at this point :)
Normally it is not an issue to pause, especially for a night. It is always an issue when the system is in an exceptional state though, e.g. when doing big models (which requires some intervention due to dependencies the system cannot see) or shortly before switching off nodes.
The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag
What you mean? I did communicate everything a few messages ago as you can see under https://huggingface.co./mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6754b97f1bc6b93608c48774 or the following quote:
I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.
-2000 488 snowflake-arctic-instruct run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
42+ 13 Gugugo-koen-7B-V1.1 run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758
I did describe exactly what I did and why I did so.
which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem
No it did start imatrix computation for Gugugo-koen-7B-V1.1
while the snowflake-arctic-instruct
imatrix computation was still running (as can be seen in above posted status page snipet) and later even tried to start another one but got luckely paused by the /tmp/pause flag. Please check your logs why this happened.
Yes the quantization tasks where an issue as well but they are not as bad as parallel imatrix tasks. Quantization tasks will eventually finish and free up enough RAM for imatrix tasks to no longer stream from SSD while if two imatrix tasks start streaming from SSD none of them will ever finish. We were lucky it was only a 7B model and so fully offloaded to GPU. What was even scarier is that despite snowflake-arctic-instruct
running on booth GPUs another imatrix task was started and it just happens to not allocate memory on the GPU not used by snowflake-arctic-instruct. If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur
Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.
No hurry then I will wait for dbX/backup1 to be gone. I already have really good performance measurements so I can already start analyzing it even without waiting for data from StromPeak or use this time to measure some other devices like my phone.
I did describe exactly what I did and why I did so.
You are right, somehow I didn't see that message, and you acted well. Sorry for doubting you.
Please check your logs why this happened.
It happened because I wanted to see the effect of it - since that model would fit completely into the vram, it should have worked, after a small disruption due to loading the model. Either that, or I would have gained understanding. I was still there when it happened, and even if it weren't planned, I would have cleaned up. The same happened when the quant jobs have been started at 22:00, which was not planned :)
There was plently of RAM availalable - it might still have started streaming due to bad memory management in linux, but that is another story.
I also don't think (and don't see) how it would have started a third imatrix job, as so far it has never tried to start three jobs, simply because it would not have the budget and gpu available. It did start a second one after snowflake was done, though.
We were lucky it was only a 7B model
It wasn't lock, there simply was no budget for (much) more.
What was even scarier is that despite snowflake-arctic-instruct running on booth GPUs another imatrix task was started
It was only running on one gpu - I changed the job to reflect that (the status display never reflected that because it was originally overwritten).
If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur
Right, and as far as I can see, that rule was never violated.
I will wait for dbX/backup1 to be gone.
Thanks, that helps a lot.
I don't think it will fail. [missing imatrix data for a tensor]
Llama.cpp commonly fails to quantize moe's for this reason (I have lots of models where I don't have imatrix quants for that reason). I do not know if this message is correlating perfectly to that (the message is new), but llama.cpp does not quantize tensors it has no imatrix data for - it's the same message you get when trying to do low-bpw quants without an imatrix. It predominantly happens on "less important" models, so I usually do not make a fuss of it and simply skip the model, or in some cases the imatrix quants.
llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
There goes my IQ1 of snowflake :/
Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?
I'm generating the remaining quants. I see only tow options: a) find training data that exercises that expert b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).
There goes my IQ1 of snowflake :/
It indeed fails but only for very low bit per weight quants. This is because as expected it statically quants the layers containing missing experts which in this case is layer 0. There is a check in llama.cpp that stops the quantization process if one tries to statically quant with a too low bit per weight as this usually results in unusable model. You are right. If there is still partial data at the end of imatrix training imatrix quantization will fail for all low bit per weight quants. All other imatrix quant will work without any issues and without any real-world quality impact as only one out of 35 layers is quantized statically so 97.1% of the model is quantized using the imatrix. Here the full llama.cpp error:
============================================================
Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
The result will be garbage, so bailing out
============================================================
Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?
I think in this specific architecture things can get rerouted to different experts for each layer so bad training data would only affect the first layer. But honestly the snowflake architecture is extremely complicated and poorly documented so I do not yet fully understand it.
I'm generating the remaining quants.
Awesome!
a) find training data that exercises that expert
I will try bartowski's imatrix training data on some smaller quant on the RTX 3080 GPU to check if it will activate all the experts.
b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).
Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage. The main issue is that it will not just affect this expert but the entire first layer. Despite there only being one expert missing the imatrix training skips storing it in its entirety. The first and last layers are usually quite important so there will be some negative impact on quality but it will be far from garbage. A better option would be to force imatrix training from storing the partial data of the first layer but I have the feeling that is that would be easy llama.cpp developers would have long done so.
Just had some worrying experience: the huggingface-cli silently failed to download all files, but also did not fail - when I tries to redo https://huggingface.co./xxx777xxxASD/L3-SnowStorm-v1.15-4x8B-B it skipped over 3 model files that are nevertheless in the repo.
I wonder how much silent corruption that would cause.
Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage.
That's not what I proposed - b) proposes to use the data, not skipping it in quantize.
Also, llama.cpp tends to crash during quantisaiton, it did not actually generate garbage quants that often, although it was one outcome.
With your proposal I would expect very bad results, because we force low bpw quantisation without any data on a tensor that seems vitasl, while the b) proposal would hopefully only leave it partially trash. The problem I see is that just writing e.g. 0 migfht make llama.cpp crash, so we might even have to synthesize data. The latter problem could be tackled when it happens, though.
None of this seems trivial to me.
I really don't want to implement your porposal in any case, I think it would be better to just leave out those quants in that case. Which also destrtyos my cxhance of getting an _IQ1_S :)
Despite there only being one expert missing the imatrix training skips storing it in its entirety.
You think all the experts are in that one tensor? (or those three, actually)
You think all the experts are in that one tensor? (or those three, actually)
The dimension of blk.0.ffn_down_exps.weight
is [ 4864, 7168, 128, 1] which indicates it contains data of all 128 experts. If you look at https://github.com/ggerganov/llama.cpp/blob/43ed389a3f102517e6f7d5620d8e451e88afbf27/gguf-py/gguf/gguf_writer.py#L138 you see that all tensors with "_exps." in the name are supposed to contain data for all experts.
That is exactly what I mean - that means your suggestion that one expert is missing does not match the data. We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.
So the naive explanation, that our training data fails to activate an expert, must be wrong.
BTW, I don't understand any details of what the imatrix actually measures, specifically, what it means to lack data for part of a tensor.
I would not be terribly surprised if this was just a model defect (and maybe not even a defect). My reaosning is that we have models that generate NaNs, and according to the llama devs, this means the model is completely unusable, yet still they work fine, so there must be a way for parts of a model to be "unused". Of course, that reasoning is weak because f16 and below can't even represent nans, afaicr.
And for something completely different, in the downloader/interactive model summary page, I have changed the quality score calculation to be strictly monotonic - before, Q8_0 would unstably sort after Q6_K because they'd end up with the same integer score of 99. Now Q8, i-Q6 and Q6 get 99, 98, 97, respectively. I think that's a reasonable trade-off between being a simple ranking and assigning meaning to absolute quality differences. It also allows sharing imatrix and static quants in one table.
I don't think I can improve it much in the short-term (especially since I didn't do client work in the last days at all), but once I found a nice way to make the link, I will put the link to the model page and update all models. On the other hand, when I am more active working on my day job, I also tend to be more active on my hobby side. Strange how these things work - if there is little for-money work, I lack the impetus of doing side projects, too.
(Example: https://hf.tst.eu/model#testrepo-i1-GGUF)
We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.
blk.0.ffn_down_exps.weight contains data for all 128 experts but we only imatrix measure 99.22% of it so we exactly miss one expert for that specific tensor. Wo do get data for all experts on all tensors not associated to layer 0. We miss one expert in one layer which causes llama.cpp to not save any imatrix data for this specific layer. We do have data of all expoerts for every other layer.
In any case I will soon try different imatrix training data so see if I can somehow manage to cover this specific expert in layer 0.
I would not be terribly surprised if this was just a model defect
There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.
There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.
I agree fully with your explanation (which matches my much more fuzzy understanding), but clearly this expert must somehow be activated if the other tensors for this expert somehow do. Clearly my understanding is flawed/missing, because _I am surprised you can activate only part of an expert. I would assume all weights to be used. But I don't know how the imatrix measurement decides what was active and what not - my understanding is that using a tensor, or an expert "slice" of it is essentially just a matrix multiplication, which should "use" all of it.
In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)
And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.
seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.
seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.
How sad. Turns out he deleted them all for nothing. Today we finally got an official document explaining the new HuggingFace storage quota: https://huggingface.co./docs/hub/storage-limits and discussed in https://huggingface.co./posts/julien-c/388331843225875
*We aim to continue providing the AI community with free storage space for public repositories, please don’t abuse and upload dozens of TBs of generated anime 😁. If possible, we still ask that you consider upgrading to PRO and/or Enterprise Hub whenever possible.
Maybe we should consider upgrading the mradermacher account to PRO as it is just $9/month which is nothing compared to ouer operation cost but it is not required for us or anyone else to do so.
I think if hf restricts the mradermacher account after saying "unlimited repositories" they are shooting themselves in the foot. They already did, though. Not sure what to do, but I am not sure it should be supported. Anyway, let's see what happens. I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too. I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.
In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less. And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads). So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.
You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...
My account looks the same btw., i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".
Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)
Still feeling a bit woozy after so much relieving news today. On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...
I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too.
This is exactly what it means. Even for free accounts storage for public repositories is unlimited as long it is not getting abused. They are mostly just begging for PRO. Like for every tech company for every PRO subscriber they have they can get a much larger sum of money from investors. This is also why the price of a PRO subscription is way lower than it should be given what you get.
I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.
They for sure are aware of us and appreciate our work.
In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less.
Awesome to hear!
And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads) So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.
Great to hear that you didn't lost any important files.
You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...
I would likely do the same if I had a file system as massive as yours.
My account looks the same, i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".
That's exactly what they mean.
Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)
Amazon S2 frequent access storage for 500TB+ is $21$/month/TB so they already pay around 100k/month in storage cost for us but that's still almost nothing compared to what the bandwidth cost must be. Let's appreciate what they give us an don't ask for more. If there are no models on HuggingFace there is no point in it even existing so our and other users time and resource investment is HuggingFaces’s biggest value and what keeps HuggingFace alive as a platform. We are essentially donating the resources of 11 server and a massive amount of time to HuggingFace and the open source AI community so I'm sure they see and appreciate what we do.
Here a screenshot of their community post which clarifies things:
Still feeling a bit woozy after so much relieving news today.
Today was awesome. I'm especialy rleaved about HuggingFace removing the storage quopta for public repositories as the storage limit worried way more than it should have.
On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...
Similar things happened so many times to me as well. It always seems to happen when I explicitly tell them to keep me sleeping.
And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.
I wonder if removing the GPU from quantisation tasks would have any performance impact. I those 400 MB don't really matter as we never really use the full GPU memory for imatrix anyways. But if it serves no purpose for quantisation we can just use llama.cpp without CUDA or set CUDA_VISIBLE_DEVICES to nothing.
In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)
Datasets I tried so far:
- c4_en_ja_imatrix
- calibration_datav3
- imatrix-with-rp-format-data
- 4chan pol_062016-112019_labeled
- Tech-Awesome-Hub/mix-data
- GitHub Readme
- MMLU
- Merges between above datasets
I the only ones that has 127 out of 128 experts other than yours was "calibration_datav3" from bartowski and " imatrix-with-rp-format-data". Many datasets got way less experts than that. It clearly is the quality of training data and not the amount that matters. 4chan pol_062016-112019_labeled is massive but when I aborted it, it only had 122 out of 128 experts on layer 0. MMLU which I though is really diverse only managed to trigger 121 out of 121 experts on layer 0. "Tech-Awesome-Hub/mix-data" was with just 120 out of 128 experts on layer 0 even worse than that.
In conclusion you have really awesome imatrix training data and many of the training data I tried was significantly worse. So "imatrix-training-full-3" is likely better than you think. I will continue trying to find datasets that activates all experts. If you have any idea what datasets to try please let me know. I'm really interested in this topic.
A somewhat urgent request for your input, deepseek imatrix just failed:
common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'
so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?
I wonder if removing the GPU from quantisation tasks would have any performance impact.
I am, as usual, unburdened by actual knowledge, but I always thought it's cpu-only. And I suspect the 384MB is some kind of, wlel, not leak, but probably some dummy workspace allocation. In any case the gpu is completely idle when quantizing.
or set CUDA_VISIBLE_DEVICES to nothing.
I'll do that and see what happens.
In conclusion you have really awesome imatrix training data
No wonder, as the first part is bartowskis training data :)
common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'
so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?
Support for the --no-context-shift
option was added to imatrix computation yesterday by bartowski in https://github.com/ggerganov/llama.cpp/pull/10766 so make sure to use latest llama.cpp or it will not have any effect.
According to https://github.com/ggerganov/llama.cpp/issues/9390 if disabled:
- Requests bigger than context window will result in an error.
- n_predict for each sequence will be capped to n_ctx - n_tokens_prompt
I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.
Online repacking got merged which removes llama.cpp support for all Q4_0_N_M quants: https://github.com/ggerganov/llama.cpp/pull/10446
I highly recommend to no longer generate them as they no longer run in latest llama.cpp. Even bartowski will no longer upload the now depreciated and unsupported ARM/RISC-V quants: https://huggingface.co./posts/bartowski/807894839859408
I'm quite happy about this llama.cpp change as ARM/RISC-V quants where kind of stupid as they used the same data just aligned differently to be optimized for a specific architecture.
I'm quite happy about this llama.cpp change as ARM/RISC-V
I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.
I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.
Hmm.. why have the option in the first place then (for imatrix computations). Weird.
Anyway, thanks a lot for your updates/feedback. I'll try it out on deepseek asap, and then probably hardcode it.
[snowflake] If you have any idea what datasets to try please let me know.
I don't, but maybe something was published on the training material, or it's area of expertise. For example, if it lists support for 22 languages, maybe we need some of these languages.
Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.
I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.
I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.
I've overriden deepseek for the time being.
PS: I haven't watched top, so I don't know if memory usage for deepseek (or the new llama-imatrix) is considerably larger than for other models.
PPS: you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?
PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.
I started nico1 again. You can do DeepSeek-V2.5-1210 now as nothing beside nico1 is currently running. I recommend you interrupt any other quantisation and imatrix task before starting it as RAM will be relatively tight.
Sorry, was sleeping. I'll have a look. I'll investigate why rich1 can no longer reach nico1.
interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.
KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.
Hmm https://huggingface.co./WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).
That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)
and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0)
. hopefully this hack will fix the stuck uploads.
And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.
I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.
Phew.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get acess, and the list of gated repos in my account settings is empty except for one collection.
@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.
- Only the files in /dev/shm (model.status, model.log) keep the model in error state. once removed the scheduler will try again once it runs.
- You can edit things, fix things, and then delete the error status files, followed by pushing (echo push nico1 >/dev/tcp/10.28.1.1/16713).
- You could move the original download away and replace it by the model subdirectory, in which case the scheduler would try to pick it up from there.
- I will eventually provide you with a better tools, though... Long term, it could make sense to move everything to nico1 (and either have containers everywhere, or simply give you a user account - I planned for these eventualities many months ago by making the default umask 0 :)
- If things go wrong, you can do the next step manually, e.g. you could somehow provide the .gguf file, and when the scheduler runs and error state is cleared, it would simply pick off from there.
- there is very little state that is not externalised, e.g. the scheduler distinguishes a partial download from a succeessful download by the existance of the model.hfd-success file. There are also .override, .interrupt, .nobudget and .force files. You can stop a model by creating a model.override, make it ignore the budget, or simply force-start it.
I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.
It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible. And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)
I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.
Adding them as separate quants was a mistake. In hindsight online conversion should have been the way to implement this for the beginning. What started with a few ARM quants got out of hand quickly and soon we have likely dozens of Q4_N_M quants optimized for different architectures so switching to online conversion was the only reasonable way for them to do. No that there is online conversion supporting existing Q4_N_M quants is useless as llama.cpp can now just write data to memory in an optimized way while loading the model.
Hmm.. why have the option in the first place then (for imatrix computations). Weird.
It's probably because imatrix computation reuses the same code as other llama.cpp components and so offers similar configurations even if some of them doesn’t really make sense for imatrix computation.
I don't, but maybe something was published on the training material, or it's area of expertise.
According to their advertisement everything should be public but I'm having trouble locating anything useful. They put everything spread across random blog articles and papers and this massive fragmentation makes finding anything too time consuming.
For example, if it lists support for 22 languages, maybe we need some of these languages.
I already tried multiple multilingual imatrix datasets without any success.
Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.
I already tried around 10 MB worth of datasets so yes it might indeed be unlikely any reasonable prompt will activate that expert. It likely is something super niche like enterprise programming language like Cobol or Erlang as it is an enterprise focused model.
I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.
Maybe that really is the way to go and it would also solve this issue with other MoE models. What I already tried is forcing the router to use all experts using --override-kv llama.expert_used_count=int:128
but it unfortunately had no effect for imatrix computation.
I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.
Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo
and inform me if it won't fit. There was a VM running using 24 GB of memory at a time and maybe some other things.
you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?
Yes I will likely look into it but quite a pain with ZFS. I really hate swap but the current behavior of it just rebooting on OOM also isn't ideal. I wonder what happened to the OOM repaper that always prevented OOM crashes in the past.
PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.
mlock would explain the crash.
interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.
No idea why this happened as well.
KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.
100 GB should be enough to test models privately. He likely got way more than 100 GB as you got CurrentPrivateStorageUsed + 100 GB when they introduced this limit. Beside going Pro he could also email them to request more private storage for testing which they will most likely accept as a valid reason for their new private storage grant program. I wonder why he is not testing them before uploading. It seems quite wasteful to upload models you have not even tested. The machine you use to train a model should also be able to run it as far I'm aware unless for merges.
I like the new policy as closed models are almost always for meant commercial use and so used by operations that really should pay for HuggingFace. They have to make money somehow and enterprise customers make the most sense in my opinion.
By the way when I researched HuggingFaces finances it seems like the vast majority of their earnings comes from consulting services. I luckely work for a company where we don't waste money hiring consultants.
In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).
Awesome to hear!
That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)
Your observation is likely correct. There was a lot more activity back then. For example take a look at https://huggingface.co./cognitivecomputations which created a Dolphin version of every good AI base model. Most of them are from early 2024.
and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0). hopefully this hack will fix the stuck uploads.
Nice. This seems like a good workaround. Let's hope this fixes this issue.
And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.
Today we reached a queue size of over 4000 so I'm really happy it will now finally go down from here. Especially now that we lose 4 hosts in one day.
I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.
Post-august you already had nico1 so there should be way less and as observed there generally are way less models recently. Before February would likely be insane but we can be way more selective.
Hmm https://huggingface.co./WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get access, and the list of gated repos in my account settings is empty except for one collection.
Sounds like a strange HuggingFace bug. Maybe they never anticipated someone ungating so many models. For easy models you can always ask me or Richard to ungate and for hard ones we always have Guilherme34
@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.
Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.
I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.
No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well. In a fraction of time required to create a webpage we could likely create a nice command line application/shell script automate all common manual tasks.
It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible.
I would love to help with this. Mainly with queuing models requested by users so they get their request fulfilled faster if you are unavailable and you don't have to care about this when you are busy. In that case it should also not matter if I'm ever too busy to help as all time I can spend on this will be an improvement over the current situation.
Should the queue ever get empty I will queue some historical models I feel are improtant and then maybe do some model authors I like but would likely run out of ideas at some point. I don't think I would have the dedication to go through and judge 30000 models to select the best ones. Your work on selecting models is highly appreciated.
And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)
No worries I like to help getting interesting models to work. My time is always limited so I can't look into every single model that failed so I focus on interesting models and the ones requested by users.
No that there is online conversion supporting existing Q4_N_M quants is useless
Well, not for those already downloaded... In any case, yes, I agree that, if its a maintenance burden, it should indeed just go.
Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo and inform me if it won't fit.
That's... I guess you'll have to tell me what you would want me to look for and/or calculate. It is, however, notoriously difficult to check this beforehand, so likely this would just make the imatrix job fail (i.e. the imatrix job would check). That's not super-bad, as that is already happening for special models.
KaraKaraWitch is now publishing all repos
Well, it's another psychological effect. The alternative would be to gate the models, I guess, and keep them public, until they are tested.
Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.
Yes, and don't be shy, even if I was a bit, ehe, cranky recently. You are interfering a great deal, and it's predominantly very useful :)
No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well
I was thinking less of you, and more of others. But, yeah, command line is the way to go at first.
You have been rate-limited; you can retry this action in about 24 hours. If you're a new user, your limits will raise progressively over time. Get in touch with us at [email protected] if you need access now.
Small models are too much for huggingface. I'll mail them.
Small models are too much for huggingface. I'll mail them.
Oh no maybe it was not so intelligent after all to do all the lage models first. We are using many nodes and such limits are usually either token or IP based but rarely user based so this should not be an issue. If it's an upload limit try giving each host a separate upload token. If it’s a download limit then maybe we are always using Guilherme34's token and so exceed that token's rate limit in which case either download anonymously or using a dedicated token by default. If this issue only occurs on rich1 then maybe it is because we just started some face picture data collection project there. In the end there really could be a per user limit in which case we have to email them or workaround the limitation.
Have you realized that mradermacher is now part of the https://huggingface.co./TopContributors-ModelDownloads organization? That is so cool!
-2000 14 Falcon3-Mamba-7B-Instruct run/imatrix (GPU-2d) 101/64 22.58s/c 42.5/126.0m(127.0) [112/335] 6.9410
Something tells me that a 7b should not take 2h for imatrix quantization.
Oh no maybe it was not so intelligent after all to do all the lage models first.
We did not do all the large models first, we only preferentially did them. rain/kaos/back etc, all did small models the whole time. So if we did it more "intelligently", we would just have hit it earlier when rich/marco/nico would hit small models randomly.
The limit is is repository creation btw., I cna try to optimize it, but I will ask for an exception first. The issue started on rich1, but has no affected everything. I suspect it was simply the speed of quantizing small static models.
Have you realized that mradermacher is now part of the https://huggingface.co./TopContributors-ModelDownloads organization? That is so cool!
Hmm, I thought I had it mentioned already, but what happened is that whenever I clicked on "Quantizations" on a model page, I got a blocking invitation page asking me to either join or refuse to join that organisation. Normally I ignore those things until I have made up my mind (not sure I want to be part of any such organisation :) but since I was forced, as my access to the webpage was limited, I hit accept.
BTW, it also made all uploads fail, which I have to clean up manually. At least it didn't hammer their servers. And since I am rather limited w.r.t. email (my private mail server is the one currently being restored), I had to ask marco to contact the website team :)
Actually, whats the company mail server queuing... only 3130 mails it can't deliver to me. Sigh.
And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.
And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.
Try using different upload tokens on each host. Even those limits are probably applied on a per upload token level according to @RichardErkhov . It’s at least worth a try.
@RichardErkhov Already reached upload limits, api limits, inference limits, file list limits, repo creation limits, request limits, gated request limits, file size limits, file count limits, space size limits so he should be an expert when it comes to limits. Almost all limits he encountered where on a per token bases. Since he uses 3 different upload tokens to no longer hit a single limit.
I've changed the upload to retry after 10 minutes when it happens, so only newly started jobs will fail(which are far easier to clean up). I'll wait for a while to see if hf increases the limit - they did it before (when it was so low that siumnply me clicking around on the webserver triggered it regularly). depending on the situation I will try different hf tokens per host. However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.
However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.
That's exactly what @RichardErkhov is doing. 1 token per host is usually enough for him for a single host but if not he uses a separate tokens for every python instance.
For now I would just do different tokens for each host as this should be simple to implement and a single host triggering the limit is relatively unlikely.
Nope, doesn't work, leia just triggered it, and leia already uses a seperate hf token.
I will try to reduce the load on the api and check in another way for successful repo creation, later tonight when I have time. Thanks, hf.
requests.exceptions.HTTPError: Invalid user token. If you didn't pass a user token, make sure you are properly logged in by executing huggingface-cli login
, and if you did pass a user token, double-check it's correct.
I am now completely blocked, it seems. I can't do anything whatsoever.
Creating new tokens does not have any effect, but from time to time, an upload goes through. It seems I am severely rate limited, and I feel this is not due to the increased upload frequency. It feels like some new limit. Also, huggingface-cli upload uses the limited create_repo API, so me reducing calls to it will have very little effect.
I guess only hf can do something about it, and if they don't in a few days, we have to shut down.
Things are slowly starting to upload again. I guess we'll find out more this evening.
Nope, the rate limit is still there, and severe. I'll try some tricks and hope there won't be a disaster tonight.
Reducing the amount of API calls to the bare minimum seems to be the only solution for now so try every trick possible. As far I'm aware every commit is an API call so maybe we should batch together some files for small models. Also make sure downloads don’t use any mradermacher token.
The rate limit doesn't seem that severe. All uploads seam to eventually make it through. Commits to 15 models where successfully made in the past hour and I see equal outgoing network traffic on nico1 than on any normal day:
The rate limit doesn't seem that severe.
I have already almost halved the number of api calls yesterday and implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it, but I think it is pretty severe, especially as we had similar rates earlier, when we did the first batch of small models (the nice 800 ones), so this definitely looks like something that has been changed since then.
Ok, the first batched uploads are through, and nothing seems to have imploded. I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.
I have already almost halved the number of api calls yesterday
Oh wow wasn't aware of that. It's quite insane we are still hitting the limit despite those changes and decommissioning db1, db2, db3 and backup1.
implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it
I think and hope so as well.
so this definitely looks like something that has been changed since then.
Yes this definmately seams like a new limit or @RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.
Ok, the first batched uploads are through, and nothing seems to have imploded.
Awesome to hear. Let's hope everything continues to go well.
I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.
Definitely not an ideal situation but better than to hit the rate limit. Everything looks good to me for the repositories I checked. I will be here and watch things but I'm quite confident nothing bad will happen. Have a great nigh!
Maybe things are not going so great after all. "auto-patch README.md" is going a bit crazy and is removing references to existing static quants on some long-completed models:
- https://huggingface.co./mradermacher/Nemotron-4-340B-Instruct-hf-i1-GGUF/commit/f4bc99d59dcd92f65c681dfc50bd6a757435f300
- https://huggingface.co./mradermacher/Hermes-3-Llama-3.1-70B-lorablated-i1-GGUF/commit/baab2edf5a54d43d775b368ff065be2d063c1da4
- https://huggingface.co./mradermacher/SILMA-9B-Instruct-v1.0-i1-GGUF/commit/d8fbfa1fe718e78034b552dbe4318482a9ace9e7
The same it also does to static quants where it removes references to imatrix quants:
I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists. Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.
Interesting it now started to fix things it previously broke:
@RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.
Haha, he often reminds me of younger me. The real test is when we hit other large blocks of static-only jobs again (today it mostly did much slower imatrix jobs).
I assume the amount of create_repo calls has gone down by a factor of about 5.
I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists.
good idea, but unfortunately, it checks for that either by downloading the README.md without the API (original model) or by using the list of all mradermacher models (for fiinding other quant repos). I'll have to look at it. As long as the original url is still intact, it will be fixable.
Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.
I'm not doing that on every change, fortunately, that's a background job that has essentially a fixed rate limit (more models == fewer iterations per time). The API affected seems to be only repo creation (which is called once or two per job, and was called twice per upload).
I'll have a look into the problem, thanks for catching those, which is a job well done :)
Interesting it now started to fix things it previously broke:
Fascinating, so, obviously intermittent errors of some kind. It runs on each repo after each job finishes, and tries to scan through all repos separately every 2 days at the moment. What you see is likely the background job that fixes the links to original urls and so on.
Hmm, not good, all those wrongly updated model pages are not in the llmjob log, so it must have been done by the background job. Unfortunately, that one really uses the list_models api call to get a list of all repos once, and then just checks if the static/imatrix repo exists, while the foregrtound job doesn'T use the api but does a GET on the actual model (html) page to see if the model exists.
Unfortunately, I indeed key the latter it on status 200, because you get all kinds of status codes when the model doesn't exist (404, 401...), so it's quite hard to know when it temporarily failed. I guess well have to live with this at the moment, unless I want to add more api calls for this.
I think repo creation has an especially low(*) api limit, and whoever did that was probably not aware of every upload calling this endpoint (in fact, I was not aware - maybe it is a recent addition, because I manually create the repo on upload because hf-upload would otherwise fail).
*: comparatively speaking :)
Unfortunately, that one really uses the list_models api call to get a list of all repos once
That is where I was wrong, it sahould have done it, but due to heavenly refactoring, it failed, so this explains it. The foreground job can still fail to correctly patch it, but the background job should get that part right. And if the original model page is missing occasionally, that shouldn't cause a diff. Famous last words.
The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).
Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.
Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.
Just because some random employee set a limit too tight doesn't mean they don't appreciate us. Someone likely just thought that limiting repository creating to 100 per hour makes sense as nobody could reasonably exceed that not realizing that the same API call is called for every commit.
The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).
They are notorious for being slow. @RichardErkhov successfully contacted them in the past regarding an API issue by creating an issue on their GitHub but they have not yet fixed it after almost 3 month despite confirming the issue: https://github.com/huggingface/huggingface_hub/issues/2581
Especially now most of them are likely already in Christmas holiday so I'm really not surprised information like this is not reaching the right persons. Even in my much smaller company bug reports often get lost somewhere in middle management. I recommend you create an issue on their huggingface_hub GitHub instead where you are much more likely to reach someone capable of fixing this issue.
But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it. It seems unlikely that us having such a massive queue of only small models will ever happen again. When we are at queue size, I'm currently very satisfied with the progress and we already got it down from over 4K to below 3.5K in just a few days.
Just because some random employee set a limit too tight doesn't mean they don't appreciate us.
No, but I have contacted three days ago, and they didn't even bother to reply. I judge by actions.
They are notorious for being slow. @RichardErkhov successfully contacted
Your example shows a reaction time of less than a day, though, so clearly they can if they want to.
I recommend you create an issue on their huggingface_hub
I am not going to create potential drama somewhere - they asked me to use e-mail, and I used e-mail. If somebody wants to do that, that is fine, but, again, I went through the official channels for this, I don't want any special service.
But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it.
I can of course live with this, but it obviously affects our throughput. An hour ago,m no quanting was done, and right now, four nodes are still not doing anything much.
Nico, I feel you are a bit panicking because I sound so negative - Don't worry, I am not trying to declare war on hf or giving up, I am merely adjusting their way too good reputation in my head. And I have learned to judge companies by their actions, not by the goodwill of fans. Or should have learned :) This is an attitude correction for me, not a disaster.
Addendum: you can probably tell by now that I am a staunch anti-neoliberalist and work for a tiny, very personal company for a reason :) Don't worry, I am also a realist :)
@mradermacher The status page(http://hf.tst.eu/status.html) is frozen since 2024-12-20 16:05:00+0100 and booth nico1 and rich1 are idle. There no longer seam any models to be uploaded so I assume something critical broke and I don't think there is anything I can do to fix it.
I checked kernel log on StromPeak and the time it broke seams to somewhat allign to the time my RTX 3080 GPU crashing but that is not used by nico1 as only the RTX 4090 GPUs are assigned to your LXC container and so should not be related:
Dec 20 15:55:19 StormPeak kernel: NVRM: GPU at PCI:0000:c1:00: GPU-c8fe94f9-541b-e16b-da0f-b8d38ea5283e
Dec 20 15:55:19 StormPeak kernel: NVRM: Xid (PCI:0000:c1:00): 62, pid='<unknown>', name=<unknown>, 2027f626 2027f426 2027fcf4 20288f2a 20288e30 2021b5b8>
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x55:2477)
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nv_open_q:2903 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nv_open_q state:D stack:0 pid:2903 tgid:2903 ppid:2 flags:0x00004000
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nvidia-smi:2356875 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nvidia-smi state:D stack:0 pid:2356875 tgid:2356875 ppid:2341557 flags:0x00004006
(...)
Dec 20 16:00:50 StormPeak kernel: INFO: task nv_queue:2901 blocked for more than 245 seconds.
Dec 20 16:00:50 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 16:00:50 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 16:00:50 StormPeak kernel: task:nv_queue state:D stack:0 pid:2901 tgid:2901 ppid:2 flags:0x0000400
After more carefully reviewing the kernel log it indeed seams that nico1 got somehow affected by the issue with the RTX 3080 GPU:
Dec 20 15:58:48 StormPeak kernel: INFO: task llama-quantize:2364235 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:llama-quantize state:D stack:0 pid:2364235 tgid:2364235 ppid:1469293 flags:0x0000000
llama-quantize should not use any GPU and the faulty GPU is not even attached to your LXC container so really strange this happened. There are tasks running so not sure if the system is in a state where it can tolerate a reboot of nico1 but it currently is not working at all so it likely can't get any worse. It would be really interesting to know how a stuck quantize task on nico1 brought the entire system to a halt.
I disconnected nico1 from the internet but still kept it running. Let's see if that is enough for the system to fix itself. All other hosts should now detect nico1 as offline and hopefully manage to recover.
It didn't help. I will reboot StormPeak now but unlikely that fixes anything as even without nico1 the system didn't recover.
I rebooted StormPeak which fixed the RTX 3080 issue and started nico1 again but as expected this unfortunately didn't fix whatever issue brought the entire system to a halt.
Good morning. I don't know what happened. A llama-quantize should hang the job only, but maybe something else also went wrong. The connection timeout (once established) is currently 3600 seconds, but that either didn't trigger or somehow it happened multiple runs of the scheduler. rich1 is also gone at the moment, which might play a role as well.
I also disabled the local scheduler a week or so ago because there is some weird bug where static jobs finish successfully within 10 seconds without doing anything, meaning static quants are not generated at all, so that didn't help either.
Obviously, there is a bug somewhere.
Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)
In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.
And if we don't talk to each other much, merry christmas and a happy new year :)
I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.
it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
(D'oh, forgot the README patcher)