About

This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.

Model Description

DeepSeek-R1-Distill-Llama-70B is available in two configurations:

  • Standard configuration (bf16)
  • Optimized configuration with weight and KV cache quantization (w8a8kv8)

Key Features

  • Model Architecture: Based on the Llama architecture with 70B parameters
  • Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
  • Quantization Innovation:
    • Weight quantization
    • KV cache optimization using fp8
  • Context Length: Supports up to 131,072 tokens
  • Precision Options:
    • bf16 for standard version
    • w8a8kv8 for optimized version

Methods

The model employs advanced quantization techniques:

  • Weight quantization for model compression
  • KV cache optimization using fp8
  • Backend optimization with FLASHINFER for enhanced performance

Model Usage

Quick Start

For optimal performance with w8a8kv8 configuration:

# Environment setup
export VLLM_ATTENTION_BACKEND=FLASHINFER

# Model configuration
model_config = {
    "max_model_len": 131072,
    "max_gen_tokens": 1024,
    "tensor_parallel_size": 2,
    "kv_cache_dtype": "fp8"
}

Hardware Requirements

  • Standard (bf16): 4 GPUs, tensor parallel size = 4
  • Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2

Model Evaluation

Benchmark Results

  1. Throughput Performance:
  • w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
  1. MMLU Benchmark Scores:
  • bf16: 0.5158 (exact match)
  • w8a8kv8: 0.5169 (exact match)
  1. Subject-specific Performance:
  • Notable improvements in:
    • Biology (+1.11%)
    • Economics (+0.83%)
    • Physics (+0.92%)
  • Slight variations in:
    • History (-1.57%)
    • Law (-1.46%)

Limitations and Bias

  • Requires specific backend optimizations for fp8 KV cache
  • Performance may vary depending on hardware configuration
  • Subject-specific performance shows slight variations across different domains

Community

Join our community discussions and get support:

Downloads last month
11
Safetensors
Model size
70.6B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for novitalabs/DeepSeek-R1-Distill-Llama-70B-w8a8kv8-s888

Quantized
(49)
this model