|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
# Reward model for HH-RLHF |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model isfor the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co./openlm-research/open_llama_3b). |
|
|
|
## Model Details |
|
|
|
### Dataset preprocessing |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
The HH-RLHF dataset contains 112K comparison samples in the training set and 12.5K comparison samples in the test set. We first replace the ``\n\nHuman'' and ``\n\nAssistant'' in the dataset by ``###Human'' and ``###Assistant'', respectively. |
|
|
|
Then, we split the dataset as follows: |
|
|
|
- SFT dataset: 112K training samples + the first 6275 samples in the test set, we only use the chosen responses; |
|
- Training set of reward modeling: 112K training samples + the first 6275 samples in the test set, we use both the chosen and rejected responses; |
|
- Test set of reward modeling: the last 6226 samples of the original test set. |
|
|
|
### Training |
|
|
|
To use the data more efficiently, we concatenate texts and split them into 1024-sized chunks, rather than padding them according to the longest text (in each batch). We then finetune the base model on the SFT dataset for two epochs, using a learning rate of 2e-5 and a linear decay schedule. |
|
|
|
We conduct reward modeling with learning rate 5e-6 for 1 epoch and linear decay schedule because it seems that the model easily overfits with more than 1 epoches. We discard the samples longer than 512 tokens so we have approximately 106K samples in the training set and 5K samples in the test set for reward modeling. |
|
|
|
We use bf16 and do not use LoRA in both of the stages. |
|
|
|
**The resulting model achieves an evaluation loss of 0.5 and an evaluation accuracy 75.48%.** |
|
|
|
|
|
|
|
## Uses |
|
|
|
```python |
|
rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/hh_rlhf_rm_open_llama_3b") |
|
|
|
rm_pipe = pipeline( |
|
"sentiment-analysis", |
|
model="weqweasdas/hh_rlhf_rm_open_llama_3b", |
|
device="auto", |
|
tokenizer=rm_tokenizer, |
|
model_kwargs={"torch_dtype": torch.bfloat16} |
|
) |
|
|
|
pipe_kwargs = { |
|
"return_all_scores": True, |
|
"function_to_apply": "none", |
|
"batch_size": 1 |
|
} |
|
|
|
test_texts = [ |
|
"###Human: My daughter wants to know how to convert fractions to decimals, but I'm not sure how to explain it. Can you help? ###Assistant: Sure. So one way of converting fractions to decimals is to ask “how many halves are there?” and then write this as a decimal number. But that's a little tricky. Here's a simpler way: if a fraction is expressed as a/b, then it's decimal equivalent is just a/b * 1.0 So, for example, the decimal equivalent of 1/2 is 1/2 * 1.0 = 0.5.", |
|
"###Human: I have fresh whole chicken in my fridge. What dish can I prepare using it that will take me less than an hour to cook? ###Assistant: Are you interested in a quick and easy recipe you can prepare with chicken you have on hand, or something more involved? In terms of both effort and time, what are you looking for?"] |
|
|
|
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs) |
|
rewards = [output[0]["score"] for output in pipe_outputs] |
|
``` |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
|
|
## Reference |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
If you found this model useful, please cite our framework and paper using the following BibTeX: |
|
|
|
|
|
``` |
|
@article{diao2023lmflow, |
|
title={Lmflow: An extensible toolkit for finetuning and inference of large foundation models}, |
|
author={Diao, Shizhe and Pan, Rui and Dong, Hanze and Shum, Ka Shun and Zhang, Jipeng and Xiong, Wei and Zhang, Tong}, |
|
journal={arXiv preprint arXiv:2306.12420}, |
|
year={2023} |
|
} |
|
``` |
|
``` |
|
@article{dong2023raft, |
|
title={Raft: Reward ranked finetuning for generative foundation model alignment}, |
|
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong}, |
|
journal={arXiv preprint arXiv:2304.06767}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
|
|
|
|
|