About
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
Model Description
DeepSeek-R1-Distill-Llama-70B is available in two configurations:
- Standard configuration (bf16)
- Optimized configuration with weight and KV cache quantization (w8a8kv8)
Key Features
- Model Architecture: Based on the Llama architecture with 70B parameters
- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
- Quantization Innovation:
- Weight quantization
- KV cache optimization using fp8
- Context Length: Supports up to 131,072 tokens
- Precision Options:
- bf16 for standard version
- w8a8kv8 for optimized version
Methods
The model employs advanced quantization techniques:
- Weight quantization for model compression
- KV cache optimization using fp8
- Backend optimization with FLASHINFER for enhanced performance
Model Usage
Quick Start
For optimal performance with w8a8kv8 configuration:
# Environment setup
export VLLM_ATTENTION_BACKEND=FLASHINFER
# Model configuration
model_config = {
"max_model_len": 131072,
"max_gen_tokens": 1024,
"tensor_parallel_size": 2,
"kv_cache_dtype": "fp8"
}
Hardware Requirements
- Standard (bf16): 4 GPUs, tensor parallel size = 4
- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2
Model Evaluation
Benchmark Results
- Throughput Performance:
- w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
- MMLU Benchmark Scores:
- bf16: 0.5158 (exact match)
- w8a8kv8: 0.5169 (exact match)
- Subject-specific Performance:
- Notable improvements in:
- Biology (+1.11%)
- Economics (+0.83%)
- Physics (+0.92%)
- Slight variations in:
- History (-1.57%)
- Law (-1.46%)
Limitations and Bias
- Requires specific backend optimizations for fp8 KV cache
- Performance may vary depending on hardware configuration
- Subject-specific performance shows slight variations across different domains
Community
Join our community discussions and get support:
- Discord: Novita AI Discord Community
- Downloads last month
- 11
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for novitalabs/DeepSeek-R1-Distill-Llama-70B-w8a8kv8-s888
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-70B