shimmyshimmer commited on
Commit
8831896
·
verified ·
1 Parent(s): f20542b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -16,12 +16,11 @@ tags:
16
 
17
  ### Instructions to run this model in llama.cpp:
18
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](https://unsloth.ai/blog/deepseek-r1)
19
- 1. Use K quantization (not V quantization)
20
- 2. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
21
- 3. Example with Q5_0 K quantized cache (V quantized cache doesn't work):
22
  ```bash
23
  ./llama.cpp/llama-cli
24
- --model unsloth/DeepSeek-V3-GGUF/DeepSeek-R1-Distill-Llama-8B-Q2_K_XS/DeepSeek-R1-Distill-Llama-8B-Q2_K_XS-00001-of-00005.gguf
25
  --cache-type-k q5_0
26
  --threads 16
27
  --prompt '<|User|>What is 1+1?<|Assistant|>'
@@ -36,7 +35,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
36
 
37
  So, **1 + 1 = 2**. [end of text]
38
  ```
39
- 4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload 5 layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
40
  ```bash
41
  /llama.cpp/llama-cli \
42
  --model DeepSeek-R1-Distill-Llama-8B-F16.gguf\
 
16
 
17
  ### Instructions to run this model in llama.cpp:
18
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](https://unsloth.ai/blog/deepseek-r1)
19
+ 1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
+ 2. Example with Q5_0 K quantized cache (V quantized cache doesn't work):
 
21
  ```bash
22
  ./llama.cpp/llama-cli
23
+ --model unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q2_K_XS/DeepSeek-R1-Distill-Llama-8B-Q2_K_XS-00001-of-00005.gguf
24
  --cache-type-k q5_0
25
  --threads 16
26
  --prompt '<|User|>What is 1+1?<|Assistant|>'
 
35
 
36
  So, **1 + 1 = 2**. [end of text]
37
  ```
38
+ 3. If you have a GPU (RTX 4090 for example) with 24GB, you can offload 5 layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
39
  ```bash
40
  /llama.cpp/llama-cli \
41
  --model DeepSeek-R1-Distill-Llama-8B-F16.gguf\