THE THREAD OF DOOM
Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(
Okay, I was wondering if we crossed some sort of line.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
@ChuckMcSneed @BigHuggyD @gghfez
Ping.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.
I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.
Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0
ones to avoid a lot of the confusion.
It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b
and creative-writer-v0.2-35b
models will be going as soon as I get the v1.0
version uploaded, and possible Dusk-Miqu-70B
if they do set a hard-limit (I still think Dark-Miqu-70B
is worth keeping whatever though).
Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B
myself!
:( Damn there was some good info in that thread.
If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.
Unfortunately, I cleaned my browser tabs up about an hour ago.
And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.
I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.
@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.
I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol
I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...
./mistralrs-server -i --isq Q4K plain -m /path/to/DeepSeek-R1:
Error: unknown variant `sigmoid`, expected `softmax` at line 60 column 27
Nice DeepSeek support. Complains about https://huggingface.co./deepseek-ai/DeepSeek-R1/blob/main/config.json. Did the dev at least try running it before claiming his program supports it?
That's shit :/
What happens if you edit the config to "softmax"? If it runs and is a bit broken then maybe worth asking them about it, but if another retarded bug/oversight then I'd just give up as they obviously haven't even tried it.
Wow, a lot happened in January eh?
I never had any luck getting mistralrs-server running a few months ago when it was one of the first models to support one of the vlm's.
If you guys are interested in speed/benchmarks, I just tested the Q2_K_L in 128GB of RAM with SSD offload on my DDR5 threadripper:
llama_perf_sampler_print: sampling time = 91.77 ms / 1828 runs ( 0.05 ms per token, 19919.80 tokens per second)
llama_perf_context_print: load time = 74543.65 ms
llama_perf_context_print: prompt eval time = 46243.71 ms / 130 tokens ( 355.72 ms per token, 2.81 tokens per second)
llama_perf_context_print: eval time = 898481.28 ms / 1697 runs ( 529.45 ms per token, 1.89 tokens per second)
llama_perf_context_print: total time = 945477.50 ms / 1827 tokens
Surprisingly... very coherent for general assistant / reasoning tasks I'd normally use o1 for.
And here's an AWS spot instance with 768GB of RAM, US$2.20 / hour, running the Q4_K_M:
llama_perf_sampler_print: sampling time = 857.38 ms / 1804 runs ( 0.48 ms per token, 2104.08 tokens per second)
llama_perf_context_print: load time = 32766.10 ms
llama_perf_context_print: prompt eval time = 4952.24 ms / 130 tokens ( 38.09 ms per token, 26.25 tokens per second)
llama_perf_context_print: eval time = 262606.15 ms / 1673 runs ( 156.97 ms per token, 6.37 tokens per second)
llama_perf_context_print: total time = 269223.00 ms / 1803 tokens
300GB ram free, so next time I'll rent a 512gb instance.
@BigHuggyD seems like 6 t/s on CPU @ $2.20 / hour is better value than the H200's (unless they're provided to you for free lol) ?
P.S. The Qwen-32b distill is useless for creative writing. Qwen just hasn't been trained on fiction imo.
Yeah, I think the less people that (can) run a model the less likely it is to get fixed sadly :/
Apparently you can run it locally with 64gb of RAM on CPU (slowly like I did above).
Kind of wish I'd gone with EPYC instead of TR now. My motherboard only has 4 RAM slots, and the 4x32gb of RAM cost a fortune already.
@BigHuggyD seems like 6 t/s on CPU @ $2.20 / hour is better value than the H200's (unless they're provided to you for free lol) ?
No freaking kidding 😂 it was FP8 vs Q4 but I wonder how much is lost. Definitely not something I would pay for out of pocket to run on a buncha H200s. My time with them comes to an end this month but they have extended me twice already so we'll see.
https://github.com/ggerganov/llama.cpp/pull/11049
It might be worth asking in the original PR above or even making a new issue to see if they can add the "Multi-head Latent Attention" code.
It looks like llama.cpp
is going to be the only viable backend that will let you run deepseek v2/v3/r1 on CPUs (the Huggingface transformers implementation seems to be completely stalled, KTransformers looks completely abandoned now, etc).
There is an experimental PR by the person who wrote the llama.cpp code for deepseek here:
https://github.com/fairydreaming/llama.cpp/tree/deepseek2-mla-exp
No idea if it works though.
What happens if you edit the config to "softmax"? If it runs and is a bit broken then maybe worth asking them about it, but if another retarded bug/oversight then I'd just give up as they obviously haven't even tried it.
Error: unknown variant `noaux_tc`, expected `greedy` or `group_limited_greedy` at line 64 column 27
After changing that it goes on to tell me that I don't have enough RAM, says that I need 2TB for some reason.
What happens if you edit the config to "softmax"? If it runs and is a bit broken then maybe worth asking them about it, but if another retarded bug/oversight then I'd just give up as they obviously haven't even tried it.
Error: unknown variant `noaux_tc`, expected `greedy` or `group_limited_greedy` at line 64 column 27
After changing that it goes on to tell me that I don't have enough RAM, says that I need 2TB for some reason.
Obviously he never tested any of it then lol
(Sorry for even suggesting this)
There is an experimental PR by the person who wrote the llama.cpp code for deepseek here:
https://github.com/fairydreaming/llama.cpp/tree/deepseek2-mla-exp
No idea if it works though.
I would hold off and keep and eye on this thread:
https://github.com/ggerganov/llama.cpp/pull/11049#issuecomment-2612910496
It looks like there is a much more efficient way of calculating MLA that deepseek mentioned in their original (v2) paper, but for some reason never implemented in their own reference python code!
If fairydreaming can get it working then this might open up an alternative to KV-cache quantisation for all models (see my post further down).