RichardErkhov/FATLLAMA-1.7T-Instruct

#359
by RichardErkhov - opened
This comment has been hidden
RichardErkhov changed discussion status to closed
RichardErkhov changed discussion title from RichardErkhov/FATLLAMA-1.7T-Instruct to -------

Why did you remove your request for https://huggingface.co./RichardErkhov/FATLLAMA-1.7T-Instruct? Have you tried it and where not satisfied with the model’s quality? Have you already started quantizing it yourself?

Why would anyone create FatLlama-1.7T? I mean, seriously, what’s the point? You wake up one day and think, “You know what we need? A model so massive that even the clouds get nervous.” It’s like deciding to build a rocket just to go to the grocery store.

If this beats BigLlama 1T then this would be the world’s best openly available LLM which would be a huge deal. The resources required to run such massive AI models are worth it if they are of generating output of better quality. There are many tasks where quality is way more important than quantity. Especially in the cooperate world where the cost of running such a model is almost neglectable.

Sure, it's impressive, but who’s running it? Probably not you, unless your PC is secretly a nuclear reactor

My PCs at home can likely run this at Q4_XXS as I can run 405B at F16. You really don't need that crazy hardware to run a model like this. A few old decommissioned servers you can get for cheap connected together over RPC will do. You just need around 1 TB RAM in total. All modern servers support 1 TB RAM so all you need is an ordinary server with 1 TB RAM. You could get DDR5 server RAM for $2500/TB if you get a decent deal which is not much more expensive than an RTX 4090.

And what’s it going to do? Maybe predict your emails before you even think of writing them, or just become really good at finding cat videos. The real question is: Are we creating these gigantic models because we can... or because we’ve got something to prove to the universe? At this point, it’s less AI and more “hold my beer, I’m gonna run this thing.”

If I can ask this model a question and on average get a better answer then using BigLama 1T it will be worth it. My time wasted by reading low quality answers and the resources required to regenerate bad answers is way more expensive then locally running such a massive model.

So there it is, FatLlama-1.7T, taking up all your hard drive space like it’s a vacation rental that overstays its welcome. Forget about saving family photos or, you know, literally anything else. Hope you didn’t need that 3TB of free space—you’ve got a digital behemoth now.

It will not take up more space than BigLlama 1T currently does as I would just store it at a lower quant as I only have 896 GiB of RAM so storing any model at a quant larger than that would be pointless. If it turns out to be better than BigLlama 1T it would obviously also replace it.

Quants? Yeah, good luck with that. I tried to quantize it, and my computer just laughed at me and went back to running Minesweeper. It’s like trying to shove a mattress into a filing cabinet—not happening.

We definitely could quant it if the model it is any good. It will happen if the model is any good and worth quantizing. Doing so wouldn't even take that long. Computing the imatrix would be a bit of a pain as doing so requires to actually run the model but we can do imatrix computation at a lower quant like we did for BigLlama 1T.

But hey, maybe one day someone will figure out how to get this thing slimmed down to IQ-1 quant, where it’ll finally fit on something that’s not the size of a small country’s power grid. Imagine that: running FatLlama on your home rig, like it’s no big deal. It’ll probably be the same day pigs fly, or, in this case, llamas. But until then, we’ll keep dreaming... and buying more external hard drives, because apparently, we’re all data hoarders now.

We could do IQ1 quants of it withing a day or so and as mentioning before even running Q4_XXS on my local rig should be possible right now and if you are willing to scrape together a few old decommissioned servers or spend a few grands you could obtain hardware capable of running it as well.

In the meantime, FatLlama just sits there, taunting you with its untouchable size, like that box of cookies you said you wouldn’t eat. Maybe it’ll eventually do something useful, like solve world hunger, or more realistically, it’ll just become the best meme-generator the world has ever seen. Because let’s be honest, that’s the true endgame for AI anyway—perfect memes, instantly.

We will see about that once I try it out in a few days.

Welp, if by some miracle you actually manage to get FatLlama-1.7T up and running, don’t get too comfy—because you know what's next, right? FatLlama 3T. Why? Because who doesn’t want to flex with even more ridiculous numbers? It’s like saying, “Oh, you lifted 1.7 trillion? Cute. Try 3 trillion, champ.” By the time you’re done maxing out your power grid and turning your house into a data center, I’ll be onto FatLlama 5.8T, which will probably require a small star as an energy source. Challenge accepted? Or should we just call NASA now?

Hell no. This is getting absolutely ridiculous. I'm almost certain that if you go beyond 1.7T quality will no longer improve by any meaningful way. I’m already skeptical if there is any significant improvement compared to BigLlama 1T. If you go beyond 1.7T you really start on getting to a point where most normal persons will be struggle running this on their home setup. But then again you can always buy more servers to connect them over RPC to run it anyways but it will be getting quite slow. I can definitely say that beyond 1.7T there will for sure not be any imatrix quants from us.

Lol. You can do it, I just recieved a lot of negative feedback when I told about it, decided people are just not ready for 1.7T. Model passes dummy check, so you can do gguf

RichardErkhov changed discussion status to open
RichardErkhov changed discussion title from ------- to RichardErkhov/FATLLAMA-1.7T-Instruct

What can I say, I dont have enough storage, I have a bunch of ram lol, so I guess it's now your time to shine @nicoboss

The model size is actually going to be a challenge even for us as it is uncommon to have such massive models. While there is 18 TB of M.2 storage in StormPeak most is in use. We use a 4 TB SSD (2x 2TB RAID 0) for spool which I always reseve for mradermacher to quantize. This model will likely be around 3.45 TB almost filling that. Maybe with compression it will take around 3 TB (or hopefully even less) only leaving around 1 TB empty. We have another 4 TB of HDD storage on upool currently in use for Llama 405B eval but gets empty in 1 day. We could download FATLLAMA-1.7T-Instruct to spool, make hf_to_gguf store the SOURCE GGUF on upool, delete the base model from spool and store the quants on spool as usual. But the issue with that is that upool is slow. Like 100 MB/s slow meaning every quant would probably take around 12 hours assuming llama.cpp manages to max out sequential HDD read speed. Other options would be to temporary move either the content of apool or bpool to upool so we can have another 4 TB SSD at our disposal. But before I can do so need to wait for Llama 405B eval to be completed. Or maybe I should finally move a real HDD into StormPeak that unlike the external HDD used for upool isn't total trash.

Hi @mradermacher . I started to move things from bpool to upool so you can soon store the FATLLAMA-1.7T-Instruct SOURCE GGUF on bpool. I recommend you already start downloading the model to spool but it will likely take at least 12 hours for everything from bpool to be moved to upool. I still have the Qwen 2.5 SOURCE GUUFs on upool and the large ones on cpool so you can delete them now from spool as otherwise storage will probably get too tight there. Regarding 405B eval don't worry I now mostly run this from cpool and so I will for the Qwen 2.5 series. I decided to run 405B eval over night so it should be done by tomorrow noon.

I now freed up enough space on bpool that the compressed FATLLAMA-1.7T-Instruct SOURCE GGUF should fit. I mounted it to your LXC container under /bpool. I recommend you download the model to spool, run hf_to_gguf with /bpool as output and then soft link the SOURCE GGUF from /bpool to /tmp after which you can run your static quant scripts as usual. This should make it much easier for us to deal with this massive model and process it in reasonable time.

You just need around 1 TB RAM in total.

You keep making my day, nico :-)

Anyway... yeah, not being in default locations is an issue for my scripts, so some manual work will be needed. I assume spool is actually my /? I could start downloading to /bpool right away, if the space runs out, I can resume (and 12 hours are over). Might be better than filling up / and then having to move. Also, is /bpool actually big enough?

mradermacher changed discussion status to closed

Yes spool is your BTRF pool mounted to /. Booth spool and bpool should be big enough but only have enough storage to store this model once. So you definately need to have source and destination on a diffrent storage pool when running hf_to_gguf. You probably don't want to end up having the SOURCE GGUF on spool as the SOURE GGUF and quants will not fit on the same storage pool too. So the best option seams to be downloading the model to spool, running hf_to_gguf with bpool as destination and then generating quants to spool.

mradermacher changed discussion status to open

bettrer keep this open. am too trigger-happy.

We might have bigger problems, an undownloadable file:

403 Forbidden: None.
Cannot access content at: https://cdn-lfs-us-1.hf.co/repos/78/b7/78b7735965...

need help @mradermacher @nicoboss ? If you give me FTP, I will go to school tomorrow to upload everything to you. Yes, I am high school student lol

or if there are any other upload options, let me know

BTW., 12 hours per quant would not an issue, really. I did this for large models in the past, and I am sure people will be patiently waiting for it :) I might look into running multiple quants in parallel - I did that in the past, and it should be safe, and essentially only be a single knob somewhere. Just pointing out we have lots of options :)

you guys have a lot of storage lol, I just have 2TB nvme and 1TB sata ssd for quants on my server

BTW., 12 hours per quant would not an issue, really.

I created this model on hard drives. Old crappy hard drives. I think it was 40 hours of writing nonstop, one of the hard drives started shouting... RIP (rest in pieces)

am too trigger-happy.

We are similar I see

I hear you. Most of my hardware was not better until nico came along with his.

I hear you. Most of my hardware was not better until nico came along with his.

I have no nico sadly haha, I have me, my laptop, random server that I rented, and a bunch of google colabs for bnb quants, and random school computer for uploads because my home internet is not very nice

But you have your nico - in the form of this free quant service :) You are not alone :-)

We might have bigger problems, an undownloadable file:

403 Forbidden: None.
Cannot access content at: https://cdn-lfs-us-1.hf.co/repos/78/b7/78b7735965...

Oh no this is bad. Which one is it? He will probably need to reupload it to HuggingFace assuming HuggingFace corrupted it.

you give me FTP, I will go to school tomorrow to upload everything to you.

Fixing the files on HuggingFace seams like a better option as other users will relay on the source model to be working. But if this is not possible for some reason you we could also do it like this.

I have no nico sadly haha, I have me, my laptop, random server that I rented, and a bunch of google colabs for bnb quants, and random school computer for uploads because my home internet is not very nice

If you ever need some resources for an AI project just let me know and if I like it I will give you an LXC container for it. You can just ping me on HuggingFace or contact me on Discord at username "nicobosshard". My upload really sucked as well in the past but I recently upgraded to 10 Gbit fiber.

model-00003-of-00352.safetensors I am trying to redownload, but for some reason, huggingface-cli so far has skipped the file.

Wow, a sync on nico1 just took over 8 minutes :-)

But you have your nico - in the form of this free quant service :) You are not alone :-)

I am the quant service haha, just for smaller than biggest model on huggingface haha

Oh no this is bad. Which one is it? He will probably need to reupload it to HuggingFace assuming HuggingFace corrupted it.

Let me know, I still have it on my hard drives, and I can always recreate the model, I rewrote a quarter of mergekit to survive the computer rethinking it's life choices and deciding to literally start a fire (dont worry, I have a fire extinguisher)

Fixing the files on HuggingFace seams like a better option as other users will relay on the source model to be working. But if this is not possible for some reason you we could also do it like this.

obviously, but just in case direct upload is still an option

My upload really sucked as well in the past but I recently upgraded to 10 Gbit fiber.

I had a fight with my server provider for rising the limits, before 10gbps was for 20 seconds, then it went down for 8 hours... Now it's 8gbps for 2 minutes maximum, then down for 5-10 minutes. Under 8gbps I am free to use nonstop, I am very happy lol. Rewriting mergekit to directly download/upload files so I dont need to use my hard drives, because why not ?

I will let you know if I release 3T model. Should I even do that ?

If you ever need some resources for an AI project just let me know

thank you for an offer, I will add you in discord in case of anything

model-00003-of-00352.safetensors I am trying to redownload, but for some reason, huggingface-cli so far has skipped the file.

let me reupload that guy for you, wait... I dont know how long to wait, depends if my internet is in the mood for upload. Just proceed with downloading everything else

So, it skips the file because it was locked from before: FATLLAMA-1.7T-Instruct/.huggingface/download/model-00003-of-00352.safetensors.lock I will try to redownload the file to see if it was just a one-time fluke. But likely we need a re-upload.

I started because it might take a while, my upload from home internet usually like to fail like 5 times before trying to do anything

nevermind, it just hashed it and automatically skipped upload. In case download fails again I will delete and upload again

Also, I am not sure it will be easy to re-upload it, because hf will insist the file is already there, even if its deleted from the repo.

Also, I am not sure it will be easy to re-upload it, because hf will insist the file is already there, even if its deleted from the repo.

very bad boy lol. I can do something about it, for example delete file, clone repo and upload. So download everything to know which files to reupload

Yup, it's repeatable (403 Forbidden). @nicoboss any idea how he could reupload it? As to get the file, yeah, he can upload it to our ftp server. We could then reupload the whole repo (effecxtively making a mirror). Unless he wants to reupload the whole thing again, or we find another way.

I know how to clone repo on huggingface, takes 5 minutes. I just need a list of nroken files to not repeat it 10 times @mradermacher

Also, @RichardErkhov

I am the quant service haha,

yes, I noticed you many times. And I always keep wondering if some duplicated effort can be avoided. The problem, I think, is that by the time I queue models, I can't see any other uploads anyway, and thus have stopped looking for them. I have no clue how coordination between quanters could be done better, but I feel uncomfortable about the amount of (apparent) unnecessary duplication.

Maybe we could find out the ideal imatrix training set that everybody can agree to use, but I doubt it :)

hmmm, central server that gives out the tasks? That can be a good idea I think, because my code is mostly automatic, until it goes wrong because I have not enough space and some random person uploads 8x7b with the name "-7b" and I run out of space. Plus Im not doing imatrix, my only good gpu is my laptop 4090, but this laptop gpu has a very limited connection, so Im just doing normal quants.
So I guess the best thing is central server that gives the tasks out, like me doing just normal quants, then you get the ready gguf from me and do imat and upload it. I can make something like that over the next week, how do you look at it? Shall we make a discord group to talk there? It's not that good to talk here @mradermacher

Also, I get 403 for 3, 5, 8, 9, 10 safetensor files. Something is seriously wrong with hf.

@RichardErkhov when you try to just use hugginface-cli to upload the files again, what happens (hugginface-cli should read all files, and then skip files that are already in the repo - but maybe it does something).

Also, I get 403 for 3, 5, 8, 9, 10 safetensor files. Something is seriously wrong with hf.

I think he is dying lol. I will just upload to your ftp, because as I know hf needs some time to process the models. I noticed it when some big model (like 405b) is uploaded, you cant download it for a day or two. So I guess for 1.7T it will be like a week to process?

@RichardErkhov when you try to just use hugginface-cli to upload the files again, what happens (hugginface-cli should read all files, and then skip files that are already in the repo - but maybe it does something).

to upload big amount of files I have to bring my hard drive to school, so can do tomorrow only. I can upload a file or two, but not a hundred

running huggingface-cli to upload rn. Maybe it will tell me anything, right now it just reads for 10 seconds, wait for 30 seconds, reads for 10 seconds etc

So I guess the best thing is central server that gives the tasks out

Well, that's what our set up here is, but there is value in diversity. If all you would do is make static quants and use a slightly different (and arguably better :) repo name, then you could just give up because static quants cost us mere minutes either....

What would make sense to me is to somehow find a way to provide different services. For example, bartowski does not do static quants, and uses a different imatrix set. He also (I think :) invests more effort into models that have problems, which often get skipped on our side.

Out speciality seems to be both big models, and niche models that other people wouldn't invest effort in. And doing mass conversions (you are pretty good at that department as well, probably even better :) Lewdiculous has great presentation (yours sucks :), but low throughput.

So, some duplication is unavoidable (or isn't even duplication), but I also don't think centralisation and one-size-fits-all is the solution, either.

That's why I said I don't really see a good way to reach all these goals in a way that's better.

How, btw., do you select the models you quantize? I wrote a script that searches for and filters new/updated repos, and then every day select a low percentage of models (mostly by name).

it just reads for 10 seconds, wait for 30 seconds

Why the heck would it behave like that... Wait, 10 seconds is about the time to read one safetensor file. So it probbaly hashes each file, then waits for something on the server side. Probably giving you a virtual upload rate of ~30MB/s or so.

That's strange, I have never seen it do that. Here, it always reads all the files first before it starts uploading or doing anything else.

Fascinating.

How, btw., do you select the models you quantize?

I just tell my server "work please" and he tries his best. Sometimes he like to work like past few days when I just randomly got 1000 repos out of nowhere, today he doesnt want to work at all. Basically it just api.list_models(model_name="-7b",task="text-generation", tags=["safetensors",], limit=2000, sort='downloads'), then filters them using some random filters and then it quants everything in multiple processes. Flags can be different, but it works usually.

static quants cost us mere minutes either....

for me it cooks my server, I guess something is wrong

but I also don't think centralisation and one-size-fits-all is the solution, either.

very True words

has great presentation (yours sucks :)

well noone really looks at me, so might as well just secretly quantize all the huggingface

well Im trying to quantize less popular models as well, that's why I run multiple instances of code going from both sort='downloads' and sort='downloads', direction=-1

If I had more power, I would probably just quantized all the huggingface lol. I mean I can try optimizing whatever you have there, basically do some team work

Why the heck would it behave like that... Wait, 10 seconds is about the time to read one safetensor file. So it probbaly hashes each file, then waits for something on the server side. Probably giving you a virtual upload rate of ~30MB/s or so.

my internet supports 3 MB/s, he cant give me 30MB/s if I dont have them lol. My server has fast upload, my home internet sucks

That's strange, I have never seen it do that. Here, it always reads all the files first before it starts uploading or doing anything else.

well I guess it suppose to do the same, but maybe it has different opinion on ready >3TB from HDD ?

that's what I see right now:

E:\mergekit\mergekit>huggingface-cli upload RichardErkhov/FATLLAMA-1.7T-Instruct ./merged .
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co./docs/huggingface_hub/hf_transfer for more details.```

my internet supports 3 MB/s

that's why I dont do imatrix, I just cant download/upload that much

poor guy is so hot I put a fan on it lol

I just tell my server "work please" and he tries his best.

So basically everything with -7b in the name that doesn't fail. You probably should venture into 8b's :)

But, it seems your set up is quite similar to ours, except we haven't automated the model selection yet. But it gives credence to your goal of quanting... everything.

Haven't been audacious to attempt "everything" (mostly because "everything" includes a lot of obvious garbage). Quite bold, I must say.

Anyway, if you did imatrix quants (which should be overkill for most of these models), then we could just skip all models with -7b in the name, problem solved :)

for me it cooks my server, I guess something is wrong

Probably not, we just have lots of servers, and nico's server probably counts as 5-10 servers on its own.

my internet supports 3 MB/s, he cant give me 30MB/

Yes, but huggingface-cli skips already uploaded files, so if it takes 40 seconds for those files, it's about ~30MB/s. And indeed, it seems to have gon through and thinks everything is already up. Sucks,

Let's see if @nicoboss has an idea, that isn't re-uploading everything. Just download it is difficult, as apparently so many files have been corrupted.

poor guy is so hot I put a fan on it lol

usb even :)

mradermacher changed discussion status to closed

You probably should venture into 8b's :)

that's just an example, I quant everything, I just need to change a line of code. It's just the latest line that was there

Let's see if @nicoboss has an idea, that isn't re-uploading everything.

delete files that are broken, clone repo, upload files again. Honestly I dont mind reuploading everything, as long as it works

Probably not, we just have lots of servers, and nico's server probably counts as 5-10 servers on its own.

so my single server cooks good ?

usb even :)

12W (12V 1A) fan on my usb attached hard drive. Why 12W fan you ask? My hard drive already melted 1 adapter, I dont want the last one to die

delete files that are broken, clone repo, upload files again. Honestly I dont mind reuploading everything, as long as it works

Probably the best way to proceed for some missing files. The bad news is that only 34 of the first 52 safetensor files are downloadable, so... it's probably simplest to redownload. But the question is, what the problem is.

We found two other repos with a single non-downloadable file so far. Never a case where so many had trouble. Something in the way you uploaded those must have caused this (or huggingface had an especially terrible day). Maybe it is worth to pester huggingface support, but in my experience, they never answer anyway,

that's just an example, I quant everything, I just need to change a line of code. It's just the latest line that was there

Hmm, but I don't see a lot of stuff in there that I stumble over almost every day (and ignore). So there must be some selection other than just "has safetensors/text-generation". Or do just so many of them fail (I have a good chance of saying whether a model will convert based on its name :)

Also, a clone would also clone deleted files. I don't think that would work. But what would suck if you reuploaded it and then again 20% of the files are corrupted, without you doing anything wrong.

Also, a clone would also clone deleted files. I don't think that would work. But what would suck if you reuploaded it and then again 20% of the files are corrupted, without you doing anything wrong.

maybe reupload slowly? It might be because of a big load? Let me try again, but with like a 2-3 minute break between each files. Will update you tomorrow on it I guess

Hmm, but I don't see a lot of stuff in there that I stumble over almost every day (and ignore). So there must be some selection other than just "has safetensors/text-generation". Or do just so many of them fail

I dont know honestly, it does whatever it wants lol

maybe reupload slowly? It might be because of a big load?

Maybe. All we know is that somehow, we get 403 for files that hf thinks are successfully uploaded. But we upload with gigabit or higher speeds all the time, and that doesn't seem to cause the issue. It might not be anything under your control, either, as this is clearly some corruption problem on hf's side. And it does seem to get worse.

I dont know honestly, it does whatever it wants lol

Makes sense. Since I select models manually, I also want to clean up failures manually. Fire & forget makes total sense if you just try to quant everything :) Still, a lot must somehow fail, as you have about double the number of models than we have, but we certainly don't queue half of the models each day.

a lot must somehow fail

It's random actually. So upload/download ratio in ideal condition is around I think 2x. Usually my upload/download is around 1.6x, but on few days it goes as down as 0.3x, like today it was 4.58 TiB down and 1.83 TiB up

For your information @RichardErkhov started uploading the model directly to a separate subvolume on bpool today at 08:00 GMT+2. He already uploaded 1300 GB. In parallel he is trying to fix the HuggingFace repository. We can expect to have the entire model by tomorrow afternoon.

I just started conversion of FATLLAMA-1.7T-Instruct. Because Richard uploaded the source model to bpool it will arrive under /tmp/new/FATLLAMA-1.7T-Instruct.SOURCE.gguf. I plan on immediately moving it back to bpool to not fill your entire spool with it. To ensure you don't accidentally fill up spool with other models in the meantime I temporary rate limited your internet speed to 1 MB/s. The conversion should be completed in 2 hours.

in the meantime, i found the reason why uploads sometimes fail despite being an endless retry loop - before i upload, i try to create the repo first, and when that fails (as it sometimes does with 504 gateway timeout), that is not retried.

if the fix works, then that was the last known mystery issue in the system.

FATLLAMA-1.7T-Instruct is now ready to be quantized. I already softlinked it from bpool to /tmp/FATLLAMA-1.7T-Instruct.SOURCE.gguf

What did we expect:

llama_model_quantize: failed to quantize: n > N_MAX: 525 > 512 for key llama.feed_forward_length

As usual, I think llama.cpp should support it out of the box first.

Oh no we again exceed LLAMA_MAX_LAYERS defined at https://github.com/ggerganov/llama.cpp/blob/1f66b699c48cb5ab3265ed72c48e8549b1674291/src/llama.cpp#L89. Just bump it to 525 and it will work. I'm a bit hesitant trying to upstreaming this change as it will make llama.cpp worse for every model that isn't FatLlama 1.7T. Does it really make sense for llama.cpp to work for every model without recompiling even if this means higher memory consumption and worse performance for every model out there just to save the hand full of users wanting to run FatLlama 1.7T the troubles of recompiling llama.cpp from source? Wouldn't it at some point make sense to instead tell those users to recompile llama.cpp themselves?

hesitant trying to upstreaming this change

If llama.cpp is inefficient because of this, it should be improved, if these models should be supported. Also, if bumping from 512 to 525 makes llama.cpp significantly worse, then it should probably not have been bumped earlier for llama-405b (a much larger bump). Or maybe llama-405b is more important than this model...

(And it might not be the only bump required).

Does it really make sense for llama.cpp to work for every model

Absolutely. Conversely, it makes very little sense to provide quants that don't run with any inference engine out there, probably all of which would require a slightly different patch. The very, very few people who run this model can also quantize on their own, with the exception of making an imatrix file, or can use bitsandbytes and transformers.

The whole point of mradermacher is to provide files that people can use with their engine of choice, all of which share the limits that llama.cpp has. Even the original creator kind of dreams of an IQ1_S, which might even work for this model, probably so he can run it without fuss.

If there is a horizon for this to be supported eventually in llama.cpp, patching makes sense. But if we already expect it to not be supported, it feels pointless.

Since we all have invested so much into this (not least of all richard uploading a copy), I am inclined to just patch it and do it anyway (like, tomorrow afternoon after I updated my install script), but it would be frustrating to invest so much, probably have to fend of people who can't understand why their engine crashes, and effectively have zero users for it.

I agree and will open an issue and pull request to fix this on llama.cpp.

As promised I created issue https://github.com/ggerganov/llama.cpp/issues/9909 (Bug: LLAMA_MAX_LAYERS must be increased to run FatLlama 1.7T) and pull request https://github.com/ggerganov/llama.cpp/pull/9910 llama : bump max layers from 512 to 1024 to hopefully get this fixed in llama.cpp.

well done. although... max_nodes, max_experts and a bunch of others suffer from similar hardcoding. but you have to start somewhere.

btw., I bumped it to a modest 576, seeing that it does not seem to be a power of two.

Thanks a lot for starting to quantize FatLlama! I'm already looking forward to try this model.

well done. although... max_nodes, max_experts and a bunch of others suffer from similar hardcoding. but you have to start somewhere.

There were even more hardcoded values in the past. All my pull requests to llama.cpp so far where regarding getting rid of hardcoded values. We already got rid of LLAMA_MAX_NODES and GGML_SCHED_MAX_SPLITS.

Let's quote https://github.com/ggerganov/llama.cpp/issues/8528#issuecomment-2232567387:

It's a consequence of keeping llama_hparams trivially copyable with a compile-time known size, while having layer-wise hyper-parameters. Increasing the limit to 512 would make llama_hparams take 6.16 KiB instead of 3.16 KiB, but that's pretty much the only thing it changes. The size.
Making the layer-wise hparams take less space when not needed is something which I'll likely fix eventually, so that the limit only applies to models which need layer-wise hparams.

So really the only reason why they want to keep LLAMA_MAX_LAYERS is so they have a compile-time known size for llama_hparams. They seem to argue about a few KB of RAM usage. Absolutely ridiculous they have not bumped it much higher. Unlike LLAMA_MAX_NODES and GGML_SCHED_MAX_SPLITS where increasing them had a significant memory/performance impact increasing LLAMA_MAX_LAYERS seems to not have any meaningful real-world consequences.

btw., I bumped it to a modest 576, seeing that it does not seem to be a power of two.

I know but because I looked into it and came to the conclusion that this limit is arbitrary and pointless I tried pushing my luck and going to the next power of two as they always did in the past. It seems unlikely have a model with more than 1000 layers anytime soon while everything between 500 and 1000 steal seams possible given how many strange models are out there.

Absolutely ridiculous they have not bumped it much higher.

I concur. It's an issue because of 6kb? If it were copied for every weight, during inferencing, they'd have a point :-)

It seems unlikely have a model with more than 1000

Richard is your friend.

I wonder if my clown suv quants or the other model that needed higher limits would now run out of the box :)

@nicoboss so, a nice distributed imatrix of the Q2_K? :-)

The lack of something useful between Q2_K and Q4_K is annoying.

@mradermacher I just said that to him in discord lol. We think the same

Actually, I forgot about Q3_K, I was so fixated on the bad IQ3 quants.

@RichardErkhov Q3_K_S is probably doable, unless nico does... things ... again, then who knows what is possible. and imatrix data is so crude, it will probably fine, too.

@nicoboss we^W"somebody" should make a comparison of imatrix quants made with an imatrix from a Q2_K vs. f16, and see how much of a difference it actually does. Who knows, maybe my original idea of first doing an imatrix from Q2_K, then doing another imatrix from the i1-Q2_K might even have merit.

I still have hopes for doing imatrix using RPC on IQ4_XS but hard to tell if it will fit untill we at least know its size. Back when I did 405B at 16 bit there was maybe room for around 50 GB more. But unsure how amount of layers affect the memory required for context size because having the model on 4 RPC servers means 4 times the memory usage required for context size.

In any case I'm going to run inference on Q3_K_S now as I wanted to try out the model anyways. I discovered that I can hardlink the resulting model while it is quantizing so it doesn't get deleted which I did for Q3_K_S.

@nicoboss we^W"somebody" should make a comparison of imatrix quants made with an imatrix from a Q2_K vs. f16, and see how much of a difference it actually does. Who knows, maybe my original idea of first doing an imatrix from Q2_K, then doing another imatrix from the i1-Q2_K might even have merit.

I definitely want to do so. I'm actually really looking forward to it. I just compleated gathering data for 405B with the 8-bit imatrix. I'm currently busy with the Qwen 2.5 series for the next few weeks but after that I would really like to try different imatrix datasets and precissions. But for sure not on FatLlama 1.7T. 405B imatrix quants took over a week to run and I still have yet to do its static quants and 16-bit imatrix quants.

I just did the calculation and it seems like Q3_K_L could barely be possible if we run absolutely nothing else on any of the machines and reduce the RAM on the OpenWrt router to 260 MiB (Note to myself: Use service dockerd stop, service redis stop on OpenWrt after boot or it will run out of RAM). This means no quantization tasks during this. Should it turn out to not be possible Q3_K_M will be easily fit as it is almost the same size as Llama 405B 16-bit. IQ4_XS will for sure not fit. Once you are done quantization static FitLlama I will try running inference on Q3_K_L. I already hard linked the file to keep it instead of Q3_K_S which I now deleted.

at the moment there is a relative lull in models, so soon would indeed be a good time. i will not submit more models to nico1 from now on.

when imatrix is done I will create a job for it and try not starting it. I will then tell you how to trigger it and where to provide the file.

So, when I unpause it, basically, the file needs to be in /tmp/FATLLAMA-1.7T-Instruct.Q3_K_L.gguf, and you can trigger the imatrix process like this, from nico1:

telnet kaos 16713
imatrixjob-push

Should work after rebooting nico1 as well. The quant process should also start automatrically once the imatrix is done, but I guess that doesn't matter, as thast wqill be far in the future.

That's just in case I am not here, and/or so you don't have to wait for me.

BTW, did we test already whether priming with another (smaller) model works?

OH right, same parameters as last time for rpc? (better repeat them)

It might also be conceivable to run llama-cli once now automatically before each imatrix attempt, but not this time :)

I triggered it but it seems to care more about the time-of-day restriction because of it having a too low priority.

8000 FATLLAMA-1.7T-Instruct                        blocked/timeofday

I set /tmp/ignoretime and triggered it again a few times but I think due to the state it is in it will not recheck if that file exists. I even tried rebooting the LXC container but also didn't help.

Wow now it somehow switched state without me doing anything and is now showing this:

8000 FATLLAMA-1.7T-Instruct                        run/imatrix (status: vram)

I now created /tmp/imatrix.force to make it skip this check as well. However I again have no idea how to retrigger it while in this state. I tried just triggering it normaly again.

It is now running!

 8000 FATLLAMA-1.7T-Instruct                        run/imatrix

Strange it is now back at showing:

8000 FATLLAMA-1.7T-Instruct                        run/imatrix (status: vram)

Oh no it is not using RPC and instead tries to offload all the layers to its own GPU. So this is now actually a llama.cpp error due to misconfiguration or the wrong llama.cpp version and not bad check.

lm_load_tensors: ggml ctx size =    4.43 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 869584.72 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model '/tmp/FATLLAMA-1.7T-Instruct.Q3_K_L.gguf'
main : failed to init

Hopefully running out og GPU memory didn't mess with any of the primed RPC servers.

Triggered it again to see what is going on:

cat /proc/167331/cmdline | sed -e "s/\x00/ /g"; echo
/root/cvs/llama.cpp/build/bin/llama-imatrix -ofreq 10 -t 1 -ngl 0 -mg 0 -m /tmp/FATLLAMA-1.7T-Instruct.Q3_K_L.gguf -o /tmp/FATLLAMA-1.7T-Instruct.Q3_K_L.imatrix~ -f imatrix-training-full-3 --rpc 192.168.2.201:7201,192.168.2.202:7202,192.168.2.203:7203,192.168.1.204:7204 -ngl 10000

Seams as if llama.cpp was compiled with GPU and without RPC.

I think I found the issue:

root@nico1:~/cvs# ls -lhtr
total 4.0K
lrwxrwxrwx 1 root root   15 Oct 16 20:45 llama.cpp -> llama.cpp-cuda/
drwx------ 1 root root 1.1K Oct 16 23:39 llama.cpp-nocuda
drwx------ 1 root root 1.1K Oct 16 23:39 llama.cpp-cuda

This softlink seams to be pointing to the wrong llama.cpp version.

It failed:

8000 FATLLAMA-1.7T-Instruct                        error/1 (status: failure)

At least this time the issue seems clear. Some firewall changes will almost certainly fix this.

Failed to connect to 192.168.2.201:7201
Failed to connect to 192.168.2.202:7202
Failed to connect to 192.168.2.203:7203
Failed to connect to 192.168.1.204:7204
/root/cvs/llama.cpp-nocuda/ggml/src/ggml-backend.cpp:2236: GGML_ASSERT(ggml_backend_supports_buft(backends[b], sched->bufts[b])) failed

It finally started sending the model to the RPC servers. Getting this to run was such a bumpy ride. I struggled the entire evening getting Q3_K_L to load. There seamed absolutely no way to fit Q3_K_L and all attempts failed until I had the awesome idea to make one of the RPC servers on StormPeak GPU only while keeping the other at GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. Using this I managed to load Q3_K_L for inference. After very carefully shifting around layers I found a configuration where IQ4_XS loaded as well. The RPC servers are currently primed with IQ4_XS and /tmp/FATLLAMA-1.7T-Instruct.Q3_K_L.gguf is a hard link of /tmp/quant/FATLLAMA-1.7T-Instruct.IQ4_XS.HARDLINK.gguf . There is quite a high likelihood that it will run out of memory during loading as even during inference RAM usage was at around 99% on booth CastlePeak and StormPeak and imatrix might need slightly more memory. Please avoid scheduling any quantize tasks to nico1 or the Linux kernel of the host will run out of memory, crash and reboot the entire server. After a server reboot caused by a crash nico1 will not be started automatically so don’t be surprised if it is offline when you wake up. I will be sleeping soon so don’t expect me to respond in the next few hours.

FatLlama imatrix computation on IQ4_XS started successfully:

8000 FATLLAMA-1.7T-Instruct                        run/imatrix 314c 505.48s 44.73/2645.32m [1] 665.9745
8000 FATLLAMA-1.7T-Instruct                        run/imatrix 314c 505.48s 53.73/2645.32m [2] 561.3982
8000 FATLLAMA-1.7T-Instruct                        run/imatrix 314c 505.48s 59.73/2645.32m [3] 558.1962

I'm honestly not so sure if this will really survive the entire imatirx process. RAM is so extremely tight that some random process using maybe 100 MiB of RAM will cause the kernel to crash. Luckily Linux is not Windows so there is at least some hope such things won't happen unless Proxmox added something stupid. Please make sure you don’t run anything consuming more than maybe 50 MiB on your LXC container while this imatrix
task is running. I created /tmp/pause but that only applies for imatrix task and also does not prevent hfd.

Time estimated by llama.cpp is 44 hours which is quite insane but there are also 525 layers which is way more than the 127 of llama 405B so it kind of makes sense.

Oh wow, you documented every step :) Well, I hope I could clarify why spooky unexplainable behaviour :)

100 MiB of RAM will cause the kernel to crash

A bit of swapspace can go a long way, at the risk of, uhm, reduced responsiveness :) But seriously, it's better to allow a bit of paging for dirty pages, because linux will otherwise have to page code pages in and out when memory is tight.

Luckily Linux is not Windows

Indeed, windows seems to be far more stable in recent years. under low memory conditions :(:(:( Linux freezing instantly on large allocations is a semi-regular experience for me. On the other hand, the kernel devs also blame nvidia for bad memory management and instability, and I could see a pattern there...

some random process using maybe 100 MiB of RAM

We could test this by rsyncing the models to be imatrixed. That could eat a few dozen MB, especially when run in parallel. :)

Hmm, I closed a few of my ssh logins on nico1 and will risk rsyncing models.

Oh wow, you documented every step :) Well, I hope I could clarify why spooky unexplainable behaviour :)

Thanks a lot for explaining. Everything is clear now.

A bit of swapspace can go a long way, at the risk of, uhm, reduced responsiveness :)
The main reason I don't use swap is because it is kind of a pain using ZFS. I did set it up swap a few weeks ago when I tried and failed creating AWQ quants for Meta-Llama-3.1-405B-Instruct-Uncensored which require you to load 16-bit into RAM on a single host. The main issue with swap was that it somehow started to get used way before the RAM was full causing the entire system to get unusable slow. I will give swap another try after the FatLlama imatrix task.

But seriously, it's better to allow a bit of paging for dirty pages, because linux will otherwise have to page code pages in and out when memory is tight.

I learned this the hard way yesterday. I had to reduce the RAM in my OpenWrt VM for enough layers to fit on Threadripper. Turns out 170 MiB of memory was not enough for the services I had running on it. This caused OpenWrt to really start streaming the code it executes from disk using 1 GB/s of SSD read speed over a span of 6 hours. As soon the imatrix task finally started successfully at like 04:30 I noticed the issue as with the addition of the RPC load the entire internet slowed down to an unusable speed. Fixing this was not easy at all. I do not have RAM hot plugging enabled for the OpenWRT VM and rebooting the it would have caused the imatrix task to break. My only option was to somehow make it use less RAM. But easier said than done as even SSH was almost unusable slow at this point. After stopping almost any of the not absolutely necessary services I managed to get the RAM usage just barely below the limit without disrupting the imatrx task. Not everyone was happy about services like the SMB shares currently not being available but FatLlama imatrix task is obviously more important than anything else. Services on OpenWrt not running is likely the smallest concern anyways as I shut down every single VM and LXC container except OpenWrt running on my entire cluster to get this imatrix task working at IQ4_XS. Perfect that we are doing this on a weekend as I could not do any development work with everything being down.

Indeed, windows seems to be far more stable in recent years. under low memory conditions :(:(:( Linux freezing instantly on large allocations is a semi-regular experience for me. On the other hand, the kernel devs also blame nvidia for bad memory management and instability, and I could see a pattern there...

The entire system crashing if you OOM GPU memory while having GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 enabled seams to almost certainly be a Nvidia issue. They should have designed the GPU drivers in a way where it would kill the running GPU process before crashing the entire system.

Windows has some really nice swap management and surprisingly advanced features regarding memory management. Mainly memory compression and page combining booth of which get activated if memory is getting low if those features are activated. Linux has Kernel Samepage Merging (KSM) as its own page combining implementation but I don't think they offer anything for memory compression. I have to say KSM on Linux is currently what keeps alive many servers at the company I'm working at. They use at least around double the memory then what is physically available but thanks to KSM it only uses around 80% of physical memory. The reason this works so well is because they are running many clones of the same VM simultaneously all loading almost the same data into memory. They all have no swap which might be something I should change after I make good experiences when trying it myself.

We could test this by rsyncing the models to be imatrixed. That could eat a few dozen MB, especially when run in parallel. :)

I checked RAM usage on the host while you were downloading models and seamed to only use very little memory. Don't run them in parallel as we really shouldn't break it now that things are going so well. We heavily increasing the TCP buffer size so this could get a memory issue as well if we have too many open TCP connections. There also is the OpenWrt router currently being in an even rouge shape. Single downloads the way you did them are perfectly fine and should (hopefully) not cause any issues.

Fixing this was not easy at all.

Indeed, if rebooting isn't an option, artistry is often required :)

but I don't think they offer anything for memory compression.

You even have the choice between zram and zswap, which do the same thing as windows memory compression. zram is in use in probably most adndroid phones, while zswap is a bit nicer to set up (zram can use an optional dedicated backend device, while zswap can use existing swap. neither needs external swap).

It's very effective if you are low on memory, but not when you have too little memory, in which case things get worse.

You can have thrashing with swapspace enabled, too, of course, in which case it will not only read at 1GBN/s, but also write at 1GB/s or so, which is worse :)

Most of my servers usually recover from such a situation, most of my desktops with nvidia not. Could simply be different usage profiles, or me being more patient to wait for a server to recover while I am asleep vs. while I am sitting in front of it, wondering if the dmcache dirtying from a hard reset will slow me down more than waiting for it to recover.

KSM

KSM can be very effective even on servers that don't run VMs, if enabled for any memory.

Windows has some really nice

It got really way better over time. I still hate it.

I checked RAM usage on the host while you were downloading

Yeah, just a few MB. And yes, I limited it to one job. Originally I just wanted to shock you a bit, and then I was... wait... I'll really do that :)

executes from disk using 1 GB/s of SSD

This is certainly what happened, but something is still strange - surely nothing is routed in userspace, i.e. skbufs and kernel code are non-swappable (and in my personal experience, the machine might seem dead and completly unresponsive, but it will still route happily, as long as no userspace decision is needed, which, at 10gbit/s, is certainly not you would do, per packet).

You must have hit the exact sweet spot for this :)

@RichardErkhov the imatrix Q2_K is incoming, and the rest (including your "requested" IQ1_S) will follow over the next days or next week or so :) From what I gather from what nico said, it might have a few issues with ortography, but is surprisingly intelligent. Guess you need to finetune it now :-)

You can watch it at http://hf.tst.eu/status.html if I haven't mentioned that yet.

@RichardErkhov there is a lovely IQ1_M available now, at the low cost of 390GB: https://huggingface.co./mradermacher/FATLLAMA-1.7T-Instruct-i1-GGUF

That's what you wanted, right? :)

@mradermacher I want IQ-1, it is -50GB model size, it owes me 50GB, try to do that please, let me know when it comes out

I can give you an empty file by quantizing everything down to zero bits, but I doubt it will satisfy. Guess we need to force @nicoboss to test the IQ1_M :)

Give 0.8 bit please

I first hardlinked IQ1_M to then realize that IQ2_XXS with 417 GiB easily fits as well so I deleted IQ1_M and hardlinked that one instead. Currently IQ2_XS is getting computed, and I have the feeling it could fit too based on my calculations but we will see. In any case now is probably a great time to try it as all imatrix tasks and eval tasks are done for today.

@nicoboss nah, the task was to test the iq1_m (and iq1_s) specifically. so you failed :-)

Sign up or log in to comment