DeepSeek-R1-Distill-Llama-8B-ENK-Aligned

Overview

DeepSeek-R1-Distill-Llama-8B-ENK-Aligned is a safety-aligned version of deepseek-ai/DeepSeek-R1-Distill-Llama-8B. It has been aligned using the Enkrypt AI Safety Alignment dataset, which was generated with the SAGE process:

SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi (2024)
[arXiv:2408.11851]

This alignment significantly reduces toxicity, harmfulness, and jailbreak vulnerabilities across various safety topics while maintaining model performance.

Red Team Results

Performance Results

Model	MMLU-Pro Score
DeepSeek-R1-Distill-Llama-8B (Base)	44.71
DeepSeek-R1-Distill-Llama-8B-ENK-Aligned	46.43

Training Configuration

The model was trained using the SimPO (Simple Preference Optimization) approach with the following hyperparameters:

cpo_config:
  loss_type: 'simpo'
  max_prompt_length: 1800
  max_length: 3600
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 1
  learning_rate: 1.8e-6
  optim: 'adamw_torch'
  lr_scheduler_type: 'cosine'
  gradient_checkpointing: True
  beta: 5
  num_train_epochs: 1
  bf16: False
  simpo_gamma: 0.8
  warmup_ratio: 0.1
  cpo_alpha: 0.0

Key Improvements

Enhanced Safety: Significant reduction in harmful or toxic outputs.
Improved Robustness: Stronger resistance to adversarial jailbreak prompts.
Minimal Performance Tradeoff: Slight improvement in MMLU-Pro despite additional alignment constraints.

Use Cases

This model is ideal for applications requiring safe, aligned, and high-performance language generation, including:

Conversational AI: Ensuring responsible and aligned assistant behavior.
Content Moderation: Filtering harmful content while maintaining contextual understanding.
Education & Research: Deploying AI in sensitive environments with reduced risks.

For questions or contributions, reach out to the Enkrypt AI team!

enkryptai
/

DeepSeek-R1-Distill-Llama-8B-Enkrypt-Aligned