metadata

quantized_by: sealad886
license_link: >-
  https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-14B/blob/main/LICENSE
language:
  - en
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
tags:
  - chat
  - mlx
  - conversations

mlx-community/DeepSeek-R1-Distill-Qwen-14B

This Model mlx-community/DeepSeek-R1-Distill-Qwen-14B contains multiple quantized variants of the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B. The model was converted to MLX format using mlx-lm version 0.21.5.

The conversion process applied different quantization strategies to produce variants that offer trade-offs between memory footprint, inference speed, and accuracy. In addition to the default 4-bit conversion, you will find both uniform and mixed quantized files at various bit widths (2-bit, 3-bit, 6-bit, and 8-bit). This multi-quantized approach allows users to select the best variant for their deployment scenario, balancing precision and performance.

Quantization Configurations

The model conversion uses a range of quantization configurations defined via mlx_lm.convert. These configurations fall into three main categories:

Uniform Quantization: Applies the same bit width to all layers.
- 3bit: Uniform 3-bit quantization.
- 4bit: Uniform 4-bit quantization (default).
- 6bit: Uniform 6-bit quantization.
- 8bit: Uniform 8-bit quantization.
Mixed Quantization: Uses a custom predicate function to decide the bit width for each layer—allowing different layers to use different precisions.
- 2,6_mixed: Uses the mixed_2_6 predicate to choose between 2-bit and 6-bit quantization.
- 3,6_mixed: Uses the mixed_3_6 predicate to choose between 3-bit and 6-bit quantization.
- 3,4_mixed: Built via mixed_quant_predicate_builder(3, 4, group_size), it mixes 3-bit and 4-bit precision.
- 4,6_mixed: Built via mixed_quant_predicate_builder(4, 6, group_size), it mixes 4-bit and 6-bit precision.
- 4,8_mixed: Built via mixed_quant_predicate_builder(4, 8, group_size), it mixes 4-bit and 8-bit precision.
Where group_size = 64 (which is default for other quantization methods).
Non-Quantized Conversions: Converts the model to a different floating point precision without quantizing weights.
- bfloat16: Model converted to bfloat16 precision.
- float16: Model converted to float16 precision.

Use with mlx

Install the MLX library:

pip install mlx-lm

Load the model and generate text:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/DeepSeek-R1-Distill-Qwen-14B-MLX")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Each configuration is optimized to meet specific requirements, enabling a forward-thinking approach in model deployment where resource constraints and performance targets are key considerations.