File size: 3,273 Bytes
199be2a
 
676fc17
199be2a
 
 
 
 
 
 
676fc17
199be2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
quantized_by: sealad886
license_link: https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/LICENSE
language:
- en
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
tags:
- chat
- mlx
- conversations
---

# mlx-community/DeepSeek-R1-Distill-Qwen-32B

This Model [mlx-community/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co./mlx-community/DeepSeek-R1-Distill-Qwen-32B) contains multiple quantized variants of the base model [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-32B). The model was converted to MLX format using mlx-lm version 0.21.5.

The conversion process applied different quantization strategies to produce variants that offer trade-offs between memory footprint, inference speed, and accuracy. In addition to the default 4-bit conversion, you will find both uniform and mixed quantized files at various bit widths (2-bit, 3-bit, 6-bit, and 8-bit). This multi-quantized approach allows users to select the best variant for their deployment scenario, balancing precision and performance.

## Quantization Configurations

The model conversion uses a range of quantization configurations defined via `mlx_lm.convert`. These configurations fall into three main categories:

1. **Uniform Quantization:**
   Applies the same bit width to all layers.
   - **3bit:** Uniform 3-bit quantization.
   - **4bit:** Uniform 4-bit quantization (default).
   - **6bit:** Uniform 6-bit quantization.
   - **8bit:** Uniform 8-bit quantization.

2. **Mixed Quantization:**
   Uses a custom predicate function to decide the bit width for each layer—allowing different layers to use different precisions.
   - **2,6_mixed:** Uses the `mixed_2_6` predicate to choose between 2-bit and 6-bit quantization.
   - **3,6_mixed:** Uses the `mixed_3_6` predicate to choose between 3-bit and 6-bit quantization.
   - **3,4_mixed:** Built via `mixed_quant_predicate_builder(3, 4, group_size)`, it mixes 3-bit and 4-bit precision.
   - **4,6_mixed:** Built via `mixed_quant_predicate_builder(4, 6, group_size)`, it mixes 4-bit and 6-bit precision.
   - **4,8_mixed:** Built via `mixed_quant_predicate_builder(4, 8, group_size)`, it mixes 4-bit and 8-bit precision.

   Where `group_size = 64` (which is default for other quantization methods).

3. **Non-Quantized Conversions:**
   Converts the model to a different floating point precision without quantizing weights.
   - **bfloat16:** Model converted to bfloat16 precision.
   - **float16:** Model converted to float16 precision.

## Use with mlx

Install the MLX library:
```bash
pip install mlx-lm
```

Load the model and generate text:
```python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/DeepSeek-R1-Distill-Qwen-32B-MLX")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, verbose=True)
```

Each configuration is optimized to meet specific requirements, enabling a forward-thinking approach in model deployment where resource constraints and performance targets are key considerations.