Hardware requirements?

#19

by JohnnieB - opened Jan 21

Discussion

JohnnieB

Jan 21

Hey guys, what hardware's required to run this model as is?

shl0th

Jan 22

@JohnnieB what do you mean by this? I was able to run this locally on a 3080

shl0th

Jan 22

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager

but also primeintellect has really cheap hardware

JohnnieB

Jan 22

I meant what hardware does the full Deepseek-R1 need to run, not the Distilled versions. It's good to know Distill-Qwen-32B can run locally on a 3080 tho.

RageshAntony

Jan 22

@shl0th
That is not DeepSeek R1, it's Qwen-32b. OP asked the original 671B R1?

Hollway

Jan 22

According to Distilled Model Evaluation, DeepSeek-R1-Distill-Qwen-32B is very close to o1-mini, must be a balanced option

vmajor

Jan 23

I meant what hardware does the full Deepseek-R1 need to run, not the Distilled versions. It's good to know Distill-Qwen-32B can run locally on a 3080 tho.

I am running the Q8_0 GGUF with llama.cpp on my 256 GB workstation. It does not actually require this much RAM since it is an MoE model, if you keep the context window modest. The KV cache consumes more RAM than the model itself. For example with context length of 32092 tokens it takes around 220 GB RAM.

RageshAntony

Jan 23

@vmajor

So, you able to run original R1 671B model with Q8_0 GGUF in 256 GB VRAM?if yes, which GPUs config you are using ?

vmajor

Jan 23

Yes, it is running now. I am exploring the odd regression issues. I am not using any GPUs, this is CPU only so the speed is not all that great (it is very slow, more like emailing than chat) but it works and can work for agent work, if I can trust it enough not to give me boilerplate nonsense on random challenges.

RageshAntony

Jan 23

@vmajor
ooh, in CPU.. that's painful for a 600B model... Which CPU you are using ?

vmajor

Jan 23

It is a MoE so (from memory) only 40B parameters are active at one time. It is far from real time, but for turn based or agent work it is fine. Dual Xeon something. Old tech, but available cheaply. When/if this space settles on some kind of 'optimum' I will look at purchasing faster hardware.

RageshAntony

Jan 23

•

edited Jan 23

@vmajor

sounds good. Which repo you sourced the Q8_0 GGUF from ?

vmajor

Jan 23

unsloth

RageshAntony

Jan 23

@vmajor
ooh. one last question.. do you have any idea of running LLM models in AMD MI300X using Oolama ?

vmajor

Jan 23

Lol no :) I do not. Also I never quite got into the ollama train. I prefer to download my own models and not have to mess around with config files ie. ollama does not simplify things for me.

likewendy

Jan 23

Why not use an API? Like using the openai o1 API.

Even if you are worried about data, there are many third-party API vendors, such as openroute

RageshAntony

Jan 23

•

edited Jan 23

@likewendy

in openrouter R1 API, do you able to get "< think > ...< / think >" part also? I only get the final result when using deepseek as provider

alain401

Jan 24

•

edited Jan 24

Running it at about 5-8 t/s on a dual EPYC CPU with 24 x 16GB of DDR5 RAM (384GB). Running the IQ4_XS version with llama.cpp. No GPU.

Total system cost a bit over $4000, CPU bought on eBay engineering samples.

alain401

Jan 24

Also using KV cache compression q4_0

kth8

Jan 26

Running it at about 5-8 t/s on a dual EPYC CPU with 24 x 16GB of DDR5 RAM (384GB). Running the IQ4_XS version with llama.cpp. No GPU.

Total system cost a bit over $4000, CPU bought on eBay engineering samples.

What motherboard are you using? The number of RAM channels make a big difference for CPU inference. I've been looking at a motherboard that support 12-channel DDR5-6000 memory at 576 GB/s https://www.phoronix.com/review/8-12-channel-epyc-9005