https://huggingface.co./deepseek-ai/DeepSeek-V3-Base
Best and largest public base model ever: 685B and beats Claude 3.5 Sonnet.
No idea if suported by llama.cpp but needs to be manual handled due to its size anyways.
Models:
I just started downloading the model. I will let you know once I know if it is llama.cpp compatible.
As expected, it is not yet supported by llama.cpp:
INFO:hf-to-gguf:Loading model: DeepSeek-V3-Base
ERROR:hf-to-gguf:Model DeepseekV3ForCausalLM is not supported
I will archive it to hpool and then we can do it as soon llama.cpp implements support for it.
685! well, it will probably be supported soon
lmao what if I frankenmerge it like fatllama? how do you even run such a model... I wish I had my 1.5TB ram server ;(
lmao what if I frankenmerge it like fatllama?
Why would you do that, richard, that is so totally out of character for you.
There now is a pull request that adds llama.cpp support for DeepseekV3 680B: https://github.com/ggerganov/llama.cpp/pull/11049
GGUF conversion is a bit of a pain as one first has to upscale the model to BF16 but other than that quit straight forward.
The PR looks great and ready to be merged so we can hopefully start soon.
I'm extremely excited about this model and am already thinking of creating an uncensored version but it would be even more challenging and more expensive than for 405B.
The PR looks great and ready to be merged so we can hopefully start soon.
Is the plan to do both base and instruct? For snowflake-arctic only the instruct was done. I'm not asking for snowflake-arctic base as I haven't even touched the instruct yet and with Deepseek on the horizon, and snowflake only having a native 4k context window I don't see myself being interested in arctic much anyways.
I'm extremely excited about this model and am already thinking of creating an uncensored version but it would be even more challenging and more expensive than for 405B.
I know they use a guard on their API, and I've heard a lot of mixed reports on how censored Deepseek is, so I'm curious how it actually is. I know you said arctic is fully uncensored, and would be lighter to run as well.
Is the plan to do both base and instruct?
Booth of them for sure. This is probably the best model ever openly released. I already downloaded booth of them. I'm just waiting for the PR adding llama.cpp support for it to get to a state where I'm confident no more breaking changes will be made. The current state already looks very promising and the tokenizer now reached parity with the original implementation but let's wait for ggerganov to at least finish his review and approve the latest changes or ideally merge them. They merged https://github.com/ggerganov/llama.cpp/pull/10902 today which is such a massive change to the entire llama.cpp codebase but luckily this doesn't seem to impact the Deepseek v3 PR that much.
Just as a heads-up. Be ready for Uncensored 405B Hermes and likely even a samantha variant of it. I'm finetuning it right now. So there are quite a lot of big models to come. I ordered 4x 18 TB of enterprise grade HDD storage which I will put in RAIDZ1 resulting in 54 TB of usable HDD storage so at least temporary storage for such massive models should soon no longer be an issue. Regarding SSD storage I will start pausing nico1 during nighttime to finally complete the performance measurement project starting from today 4th of January at 17:00 so I can eventually free of the 8 TB of SSD storage currently occupied by Qwen 2.5 models.
For snowflake-arctic only the instruct was done.
The snowflake-arctic base model will be done as soon as I have time to look into a better way to quant despite the lack of data for that specific expert. Or maybe we should just proceed. It really is quite an awesome model in terms of being fully uncensured and trained which very different data compared to any other models. Not because of its writing style but its knowledge. It gives quite different answers compared to almost every other model and perfect to be used as second opinion for a given question. In case you wonder Guilherme34 describes its writing style as "robotic".
I know they use a guard on their API, and I've heard a lot of mixed reports on how censored Deepseek is, so I'm curious how it actually is.
I'm very curious as well. If it is censored I will for sure try my best to make it uncensored. Deepseek v3 t is probably better than Llama 405B while being as fast to run as a 37B model. I’m really impressed by it so far.
The snowflake-arctic base model will be done as soon as I have time to look into a better way to quant despite the lack of data for that specific expert.
I completely forgot. We should do it asap, who knows if we can ever get around the "proper" fix. I am still suggesting just forcing to write out the measurement data, partial or not, and see how that works out. I also don't see an issue doing it from Q4_K_M or so, either :-)
I am still suggesting just forcing to write out the measurement data, partial or not, and see how that works out.
I really like this idea and would be cool to try. Do you have any idea how to implement that? I'm not that familiar with that part of llama.cpp but can look into it.
You are as familiar, or more, than me, because my familarity is zero. I'll have a look, though, as soon as I can find time. I suspect all we have to do is skip an if condition nearby the message we got. If llama really somehow stores this data in a sparse format and it doesn't exist at all, too bad. But to tell us it has, say, 91% of tensor, I think there is a good chance that the code that normally writes out the results can just write out partial tensors.
I'd say it's as simple as this:
diff --git a/examples/imatrix/imatrix.cpp b/examples/imatrix/imatrix.cpp
index 45206f4a..9ad9383d 100644
--- a/examples/imatrix/imatrix.cpp
+++ b/examples/imatrix/imatrix.cpp
@@ -248,7 +248,7 @@ void IMatrixCollector::save_imatrix(int ncall) const {
if (n_zeros > 0) {
LOG_WRN("%s: entry '%40s' has partial data (%.2f%%) - skipping\n", __func__, kv.first.c_str(), 100.0f * (n_all - n_zeros) / n_all);
- continue;
+ //continue;
}
n_entries++;
If you want to play around (I would appreciate that :), the way to go, I think, would be to use some imatrix data on a low-bit static quant, then requantize the model with the imatrix data and see if it still "works" (with IQ1_S or at most IQ2_*). I suspect if everything runs without crashing, we are not worse off than before, but very likely better. I suspect the highest likelihood of something going wrong is the quantize crashing because it can't find a solution for the "zeroed" values, in which case something more complicated would be needed, such as replacing the zeroes by something else. But I think the zeroes are just counts, not actually calibration values.
https://github.com/ggerganov/llama.cpp/pull/11049 is now merged! I will convert the base models to GGUFs as soon llama-beanch for Qwen2.5-72B/Qwen2.5-72B-Instruct is done or as first thing in the morning. Before we can do imatrix quants I will need to move the RTX 3080 back into StormPeak requiering sheduled hardware maintenance - I will likely do that once source GGUFs are done. We will also need the RPC setup for Hermes-3-Llama-3.1-405B-Uncensored and the soon to be released Hermes-3-Llama-3.1-405B-Samantha I'm finetuning right now.
I'd say it's as simple as this
Thanks a lot for the diff. I will for sure give it a try once we are done with the other massive models.
Thanks a lot for the diff. I will for sure give it a try once we are done with the other massive models.
Well, it's quite low effort and would cost you a minute to grep. The point was more that just forcing it to write what it would normally write should be a first try. And playing around with it nearby should get us somewhere. If it works, I might even consider doing this by default, as it is not an uncommon problem.
@mradermacher DeepSeek-V3 is ready for static quants!
I created the following softlinks:
/tmp/DeepSeek-V3.gguf -> /bpool/DeepSeek-V3.SOURCE.gguf
/tmp/quant/DeepSeek-V3.gguf -> /bpool/DeepSeek-V3.SOURCE.gguf
Feel free to ignore all time-of-day limitations for this one and make sure to use latest llama.cpp
it's quanting
we could try static IQ3 and see how that works out? I'm close to enabling it by default...
we could try static IQ3 and see how that works out? I'm close to enabling it by default...
Sounds like a great idea for such massive models. Static IQ3 seems like a gamble. Sometimes it turns out well and sometimes it sucks but the larger the model the more likely it should be to turn out well. No idea about MoE models like this one so would be interesting to try as it gives us another datapoint. Another interesting datapoint would be to requant some of the models where static IQ3 was terrible using latest llama.cpp and see if they are still terrible.
@mradermacher DeepSeek-V3-Base is ready for static quants!
I created the following softlinks:
/tmp/DeepSeek-V3-Base.gguf -> /bpool/DeepSeek-V3-Base.SOURCE.gguf
/tmp/quant/DeepSeek-V3-Base.gguf -> /bpool/DeepSeek-V3-Base.SOURCE.gguf
Feel free to ignore all time-of-day limitations for this one as well. No idea if we can quant two such massive models at once with the 4 TB of storage we have.
Please prioritize DeepSeek-V3-Base higher than Hermes-3-Llama-3.1-405B-Uncensored and Hermes-3-Llama-3.1-405B-Samantha as many seem to be waiting for it.
queued