Configurable Safety Tuning of Language Models with Synthetic Preference Data
Abstract
State-of-the-art language model fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constraints of vanilla DPO by introducing a system prompt specifying safety configurations, enabling LLM deployers to disable/enable safety preferences based on their need, just changing the system prompt. Our experimental evaluations indicate that CST successfully manages different safety configurations and retains the original functionality of LLMs, showing it is a robust method for configurable deployment. Data and models available at https://github.com/vicgalle/configurable-safety-tuning
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates (2024)
- Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning (2024)
- Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models (2024)
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (2024)
- Guardrail Baselines for Unlearning in LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend