About

This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.

Model Description

DeepSeek-R1-Distill-Llama-70B is available in two configurations:

Standard configuration (bf16)
Optimized configuration with weight and KV cache quantization (w8a8kv8)

Key Features

Model Architecture: Based on the Llama architecture with 70B parameters
Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
Quantization Innovation:
- Weight quantization
- KV cache optimization using fp8
Context Length: Supports up to 131,072 tokens
Precision Options:
- bf16 for standard version
- w8a8kv8 for optimized version

Methods

The model employs advanced quantization techniques:

Weight quantization for model compression
KV cache optimization using fp8
Backend optimization with FLASHINFER for enhanced performance

Model Usage

Quick Start

For optimal performance with w8a8kv8 configuration:

# Environment setup
export VLLM_ATTENTION_BACKEND=FLASHINFER

# Model configuration
model_config = {
    "max_model_len": 131072,
    "max_gen_tokens": 1024,
    "tensor_parallel_size": 2,
    "kv_cache_dtype": "fp8"
}

Hardware Requirements

Standard (bf16): 4 GPUs, tensor parallel size = 4
Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2

Model Evaluation

Benchmark Results

Throughput Performance:

w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16

MMLU Benchmark Scores:

bf16: 0.5158 (exact match)
w8a8kv8: 0.5169 (exact match)

Subject-specific Performance:

Notable improvements in:
- Biology (+1.11%)
- Economics (+0.83%)
- Physics (+0.92%)
Slight variations in:
- History (-1.57%)
- Law (-1.46%)

Limitations and Bias

Requires specific backend optimizations for fp8 KV cache
Performance may vary depending on hardware configuration
Subject-specific performance shows slight variations across different domains

Community

Join our community discussions and get support:

Discord: Novita AI Discord Community

novitalabs
/

DeepSeek-R1-Distill-Llama-70B-w8a8kv8-s888