|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Anthropic/hh-rlhf |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- rlhf |
|
- alignment |
|
- simulation |
|
- computational social science |
|
--- |
|
|
|
|
|
# Model Card for So(cially)-Good LM |
|
|
|
![model image](https://agwarbliu.s3.amazonaws.com/logo.png) |
|
|
|
![model image](https://agwarbliu.s3.amazonaws.com/model_select_ours.png) |
|
|
|
|
|
**Efficient, Effective, and Stable alternative of RLHF!** |
|
|
|
**Instead of training an additional reward model that is likely to be gamed, we directly train the model on the social games!** 🕹️ 🎲 🎮 |
|
|
|
Full details on simulation and training can be found [here](https://github.com/agi-templar/Stable-Alignment). |
|
|
|
# Training Procedure |
|
|
|
Trained with [Stable Alignment](https://github.com/agi-templar/Stable-Alignment) on 8xA100s for 3H. The start checkpoint is the [SFT model](https://huggingface.co./agi-css/hh-rlhf-sft). |
|
|
|
We have also released the [better-base model](https://huggingface.co./agi-css/better-base) which is the start checkpoint of SFT. |
|
|
|
Here is the training script: |
|
|
|
```shell |
|
torchrun --nproc_per_node=8 --master_port=36646 train_alignment.py \ |
|
--model_name_or_path /workspace/hhh-sft \ |
|
--data_path /workspace/sandbox_v1.json \ |
|
--bf16 True \ |
|
--output_dir /workspace/output_lm \ |
|
--num_train_epochs 2 \ |
|
--per_device_train_batch_size 1 \ |
|
--per_device_eval_batch_size 1 \ |
|
--gradient_accumulation_steps 8 \ |
|
--evaluation_strategy "no" \ |
|
--save_strategy "steps" \ |
|
--save_steps 200 \ |
|
--save_total_limit 1 \ |
|
--learning_rate 2e-5 \ |
|
--weight_decay 0. \ |
|
--warmup_ratio 0.03 \ |
|
--lr_scheduler_type "cosine" \ |
|
--logging_steps 1 \ |
|
--fsdp "full_shard auto_wrap" \ |
|
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ |
|
--tf32 True \ |
|
--model_max_length 480 \ |
|
--rating_scale 7 \ |
|
--margin 1 \ |
|
--max_flow False \ |
|
--ratio 0.2 \ |
|
--num_comp 3 |
|
``` |
|
|
|
# Bias, Risks, and Limitations |
|
|
|
Although this project aims to better align current LMs with social norms, inappropriate content and inherent biases in the training data will still impair the alignment of the model. |
|
|
|
The model should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application. |