hdnh2006
/

DeepSeek-R1-Distill-Qwen-1.5B-GGUF

+---
+license: apache-2.0
+base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+tags:
+  - deepseek
+  - llama.cpp
+library_name: transformers
+pipeline_tag: text-generation
+quantized_by: hdnh2006
+---
+# DeepSeek-R1-Distill-Qwen-1.5B GGUF llama.cpp quantization by [Henry Navarro](https://henrynavarro.org) 🧠🤖
+This repository contains GGUF format model files for DeepSeek-R1-Distill-Qwen-1.5B, quantized using [llama.cpp](https://github.com/ggerganov/llama.cpp).
+All the models have been quantized following the [instructions](https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md#quantize) provided by llama.cpp. This is:
+```bash
+# obtain the official LLaMA model weights and place them in ./models
+ls ./models
+llama-2-7b tokenizer_checklist.chk tokenizer.model
+# [Optional] for models using BPE tokenizers
+ls ./models
+<folder containing weights and tokenizer json> vocab.json
+# [Optional] for PyTorch .bin models like Mistral-7B
+ls ./models
+<folder containing weights and tokenizer json>
+# install Python dependencies
+python3 -m pip install -r requirements.txt
+# convert the model to ggml FP16 format
+python3 convert_hf_to_gguf.py models/mymodel/
+# quantize the model to 4-bits (using Q4_K_M method)
+./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
+# update the gguf filetype to current version if older version is now unsupported
+./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
+```
+## Model Details
+Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+## Summary models 📋
+| Filename | Quant type | Description |
+| -------- | ---------- | ----------- |
+| [DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf) | F16 | Half precision, no quantization applied |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf) | Q8_0 | 8-bit quantization, highest quality, largest size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf) | Q6_K | 6-bit quantization, very high quality |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_1.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_1.gguf) | Q5_1 | 5-bit quantization, good balance of quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf) | Q5_K_M | 5-bit quantization, good balance of quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_S.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_S.gguf) | Q5_K_S | 5-bit quantization, good balance of quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q5_0.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_0.gguf) | Q5_0 | 5-bit quantization, good balance of quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_1.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_1.gguf) | Q4_1 | 4-bit quantization, balanced quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf) | Q4_K_M | 4-bit quantization, balanced quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_S.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_S.gguf) | Q4_K_S | 4-bit quantization, balanced quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf) | Q4_0 | 4-bit quantization, balanced quality and size |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf) | Q3_K_L | 3-bit quantization, smaller size, lower quality |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_M.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_M.gguf) | Q3_K_M | 3-bit quantization, smaller size, lower quality |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_S.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_S.gguf) | Q3_K_S | 3-bit quantization, smaller size, lower quality |
+| [DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf](https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf) | Q2_K | 2-bit quantization, smallest size, lowest quality |
+## Usage with Ollama 🦙
+### Direct from Ollama
+```
+ollama run hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B
+```
+## Download Models Using huggingface-cli 🤗
+### Installation of `huggingface_hub[cli]`
+```bash
+pip install -U "huggingface_hub[cli]"
+```
+### Downloading Specific Model Files
+```bash
+huggingface-cli download hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B --include "DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf" --local-dir ./
+```
+## Which File Should I Choose? 📈
+A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).
+### Assessing System Capabilities
+1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run.
+2. **Optimizing for Speed**:
+    - **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version that’s 1-2GB smaller than the total VRAM.
+3. **Maximizing Quality**:
+    - **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total.
+### Deciding Between 'I-Quant' and 'K-Quant'
+1. **Simplicity**:
+    - **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M.
+2. **Advanced Configuration**:
+    - **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).
+    - **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size.
+    - **Compatibility Considerations**:
+        - **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff.
+        - **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan.
+        - **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds.
+By following these guidelines, you can make an informed decision on which file best suits your system and performance needs.
+## Contact 🌐
+Website: henrynavarro.org
+Email: [email protected]