Update README.md with model information and quantization details
Browse files
README.md
ADDED
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
4 |
+
tags:
|
5 |
+
- deepseek
|
6 |
+
- llama.cpp
|
7 |
+
library_name: transformers
|
8 |
+
pipeline_tag: text-generation
|
9 |
+
quantized_by: hdnh2006
|
10 |
+
---
|
11 |
+
|
12 |
+
# DeepSeek-R1-Distill-Qwen-1.5B GGUF llama.cpp quantization by [Henry Navarro](https://henrynavarro.org) 🧠🤖
|
13 |
+
|
14 |
+
|
15 |
+
This repository contains GGUF format model files for DeepSeek-R1-Distill-Qwen-1.5B, quantized using [llama.cpp](https://github.com/ggerganov/llama.cpp).
|
16 |
+
|
17 |
+
All the models have been quantized following the [instructions](https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md#quantize) provided by llama.cpp. This is:
|
18 |
+
```bash
|
19 |
+
# obtain the official LLaMA model weights and place them in ./models
|
20 |
+
ls ./models
|
21 |
+
llama-2-7b tokenizer_checklist.chk tokenizer.model
|
22 |
+
# [Optional] for models using BPE tokenizers
|
23 |
+
ls ./models
|
24 |
+
<folder containing weights and tokenizer json> vocab.json
|
25 |
+
# [Optional] for PyTorch .bin models like Mistral-7B
|
26 |
+
ls ./models
|
27 |
+
<folder containing weights and tokenizer json>
|
28 |
+
|
29 |
+
# install Python dependencies
|
30 |
+
python3 -m pip install -r requirements.txt
|
31 |
+
|
32 |
+
# convert the model to ggml FP16 format
|
33 |
+
python3 convert_hf_to_gguf.py models/mymodel/
|
34 |
+
|
35 |
+
# quantize the model to 4-bits (using Q4_K_M method)
|
36 |
+
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
|
37 |
+
|
38 |
+
# update the gguf filetype to current version if older version is now unsupported
|
39 |
+
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
|
40 |
+
```
|
41 |
+
|
42 |
+
|
43 |
+
## Model Details
|
44 |
+
|
45 |
+
Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
46 |
+
|
47 |
+
## Summary models 📋
|
48 |
+
| Filename | Quant type | Description |
|
49 |
+
| -------- | ---------- | ----------- |
|
50 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf) | F16 | Half precision, no quantization applied |
|
51 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf) | Q8_0 | 8-bit quantization, highest quality, largest size |
|
52 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf) | Q6_K | 6-bit quantization, very high quality |
|
53 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_1.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_1.gguf) | Q5_1 | 5-bit quantization, good balance of quality and size |
|
54 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf) | Q5_K_M | 5-bit quantization, good balance of quality and size |
|
55 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_S.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_S.gguf) | Q5_K_S | 5-bit quantization, good balance of quality and size |
|
56 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_0.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_0.gguf) | Q5_0 | 5-bit quantization, good balance of quality and size |
|
57 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_1.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_1.gguf) | Q4_1 | 4-bit quantization, balanced quality and size |
|
58 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf) | Q4_K_M | 4-bit quantization, balanced quality and size |
|
59 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_S.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_S.gguf) | Q4_K_S | 4-bit quantization, balanced quality and size |
|
60 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf) | Q4_0 | 4-bit quantization, balanced quality and size |
|
61 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf) | Q3_K_L | 3-bit quantization, smaller size, lower quality |
|
62 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_M.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_M.gguf) | Q3_K_M | 3-bit quantization, smaller size, lower quality |
|
63 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_S.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_S.gguf) | Q3_K_S | 3-bit quantization, smaller size, lower quality |
|
64 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf) | Q2_K | 2-bit quantization, smallest size, lowest quality |
|
65 |
+
|
66 |
+
|
67 |
+
## Usage with Ollama 🦙
|
68 |
+
|
69 |
+
### Direct from Ollama
|
70 |
+
```
|
71 |
+
ollama run hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B
|
72 |
+
```
|
73 |
+
|
74 |
+
## Download Models Using huggingface-cli 🤗
|
75 |
+
|
76 |
+
### Installation of `huggingface_hub[cli]`
|
77 |
+
```bash
|
78 |
+
pip install -U "huggingface_hub[cli]"
|
79 |
+
```
|
80 |
+
|
81 |
+
### Downloading Specific Model Files
|
82 |
+
```bash
|
83 |
+
huggingface-cli download hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B --include "DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf" --local-dir ./
|
84 |
+
```
|
85 |
+
|
86 |
+
## Which File Should I Choose? 📈
|
87 |
+
|
88 |
+
A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).
|
89 |
+
|
90 |
+
### Assessing System Capabilities
|
91 |
+
1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run.
|
92 |
+
2. **Optimizing for Speed**:
|
93 |
+
- **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version that’s 1-2GB smaller than the total VRAM.
|
94 |
+
3. **Maximizing Quality**:
|
95 |
+
- **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total.
|
96 |
+
|
97 |
+
### Deciding Between 'I-Quant' and 'K-Quant'
|
98 |
+
1. **Simplicity**:
|
99 |
+
- **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M.
|
100 |
+
2. **Advanced Configuration**:
|
101 |
+
- **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).
|
102 |
+
- **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size.
|
103 |
+
- **Compatibility Considerations**:
|
104 |
+
- **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff.
|
105 |
+
- **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan.
|
106 |
+
- **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds.
|
107 |
+
|
108 |
+
By following these guidelines, you can make an informed decision on which file best suits your system and performance needs.
|
109 |
+
|
110 |
+
|
111 |
+
## Contact 🌐
|
112 |
+
Website: henrynavarro.org
|
113 |
+
|
114 |
+
Email: [email protected]
|