danielhanchen commited on
Commit
9123ce8
·
verified ·
1 Parent(s): 8831896

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -12
README.md CHANGED
@@ -17,13 +17,15 @@ tags:
17
  ### Instructions to run this model in llama.cpp:
18
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](https://unsloth.ai/blog/deepseek-r1)
19
  1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
- 2. Example with Q5_0 K quantized cache (V quantized cache doesn't work):
21
  ```bash
22
- ./llama.cpp/llama-cli
23
- --model unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q2_K_XS/DeepSeek-R1-Distill-Llama-8B-Q2_K_XS-00001-of-00005.gguf
24
- --cache-type-k q5_0
25
- --threads 16
26
- --prompt '<|User|>What is 1+1?<|Assistant|>'
 
 
27
  ```
28
  Example output:
29
  ```txt
@@ -35,13 +37,15 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
35
 
36
  So, **1 + 1 = 2**. [end of text]
37
  ```
38
- 3. If you have a GPU (RTX 4090 for example) with 24GB, you can offload 5 layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
39
  ```bash
40
- /llama.cpp/llama-cli \
41
- --model DeepSeek-R1-Distill-Llama-8B-F16.gguf\
42
- --cache-type-k q8_0 \
43
- --prompt '<|User|>What is 1+1?<|Assistant|>' \
44
- --threads 32 \
 
 
45
  -no-cnv
46
  ```
47
 
 
17
  ### Instructions to run this model in llama.cpp:
18
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](https://unsloth.ai/blog/deepseek-r1)
19
  1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
+ 2. Example with K & V quantized cache **Notice -no-cnv disables auto conversation mode**
21
  ```bash
22
+ ./llama.cpp/llama-cli \
23
+ --model unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q2_K_L.gguf \
24
+ --cache-type-k q8_0 \
25
+ --cache-type-v q8_0 \
26
+ --threads 16 \
27
+ --prompt '<|User|>What is 1+1?<|Assistant|>' \
28
+ -no-cnv
29
  ```
30
  Example output:
31
  ```txt
 
37
 
38
  So, **1 + 1 = 2**. [end of text]
39
  ```
40
+ 4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
41
  ```bash
42
+ ./llama.cpp/llama-cli \
43
+ --model unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q2_K_L.gguf
44
+ --cache-type-k q8_0
45
+ --cache-type-v q8_0
46
+ --threads 16
47
+ --prompt '<|User|>What is 1+1?<|Assistant|>'
48
+ --n-gpu-layers 20 \
49
  -no-cnv
50
  ```
51