mlx-community/DeepSeek-R1-Distill-Qwen-1.5B

This Model mlx-community/DeepSeek-R1-Distill-Qwen-1.5B contains multiple quantized variants of the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. The model was converted to MLX format using mlx-lm version 0.21.5.

The conversion process applied different quantization strategies to produce variants that offer trade-offs between memory footprint, inference speed, and accuracy. In addition to the default 4-bit conversion, you will find both uniform and mixed quantized files at various bit widths (2-bit, 3-bit, 6-bit, and 8-bit). This multi-quantized approach allows users to select the best variant for their deployment scenario, balancing precision and performance.

Quantization Configurations

The model conversion uses a range of quantization configurations defined via mlx_lm.convert. These configurations fall into three main categories:

  1. Uniform Quantization: Applies the same bit width to all layers.

    • 3bit: Uniform 3-bit quantization.
    • 4bit: Uniform 4-bit quantization (default).
    • 6bit: Uniform 6-bit quantization.
    • 8bit: Uniform 8-bit quantization.
  2. Mixed Quantization: Uses a custom predicate function to decide the bit width for each layer—allowing different layers to use different precisions.

    • 2,6_mixed: Uses the mixed_2_6 predicate to choose between 2-bit and 6-bit quantization.
    • 3,6_mixed: Uses the mixed_3_6 predicate to choose between 3-bit and 6-bit quantization.
    • 3,4_mixed: Built via mixed_quant_predicate_builder(3, 4, group_size), it mixes 3-bit and 4-bit precision.
    • 4,6_mixed: Built via mixed_quant_predicate_builder(4, 6, group_size), it mixes 4-bit and 6-bit precision.
    • 4,8_mixed: Built via mixed_quant_predicate_builder(4, 8, group_size), it mixes 4-bit and 8-bit precision.

    Where group_size = 64 (which is default for other quantization methods).

  3. Non-Quantized Conversions: Converts the model to a different floating point precision without quantizing weights.

    • bfloat16: Model converted to bfloat16 precision.
    • float16: Model converted to float16 precision.

Use with mlx

Install the MLX library:

pip install mlx-lm

Load the model and generate text:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-MLX")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Each configuration is optimized to meet specific requirements, enabling a forward-thinking approach in model deployment where resource constraints and performance targets are key considerations.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for mlx library.

Model tree for mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-MLX

Finetuned
(154)
this model