metadata

license: mit
datasets:
  - stanfordnlp/SHP
language:
  - en
metrics:
  - accuracy
tags:
  - human feedback
  - rlhf
  - preferences
  - reddit
  - preference model
  - RL

SteamSHP

SteamSHP is a preference model trained to predict human preferences, given some context and two possible responses. It can be used for NLG evaluation or to train a smaller reward model for RLHF.

It is a FLAN-T5-xl model (3B parameters) finetuned on:

The Stanford Human Preferences Dataset (SHP), which contains aggregate human preferences sourced from 18 different communities on Reddit (e.g., askculinary, legaladvice, etc.)
The helpfulness data in Anthropic's HH-RLHF dataset.

Training and Evaluation

SteamSHP was only finetuned on 125K of the 392K training examples that were available, since we found that:

When the total input length exceeded the limit (512 tokens), the loss would not converge. When possible, we crammed an example into 500 tokens by truncating the context as much as possible, though some examples would still not fit.
Training on fewer preferences with a stronger signal led to better performance than training on all the preferences. From the SHP dataset, we only used preferences where the more preferred comment was twice as preferred as the other (i.e., score_ratio >= 2) and used no more than 5 preferences from each context (i.e., post_id) to prevent ovefitting.

We evaluated the model on the SHP and HH-RLHF test data using accuracies, but only on the data that could be truncated to fit within 500 tokens (a total of 18621 examples). SteamSHP gets an average 72.8% accuracy across all domains:

Domain	Accuracy
askculinary	0.7199
askhr	0.7743
askdocs	0.7210
askanthropology	0.7594
asksciencefiction	0.7283
askacademia	0.7442
askengineers	0.7183
legaladvice	0.8068
explainlikeimfive	0.7392
askbaking	0.6741
askphysics	0.8000
askscience	0.7114
askphilosophy	0.6907
askvet	0.7742
changemyview	0.7043
askcarguys	0.7568
askhistorians	0.7476
asksocialscience	0.7308
anthropic (helpfulness)	0.7310
ALL	0.7278

Usage

Here's how to load the model:


from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-preference-model')
model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-preference-model')

The input text should be of the format:

POST: { the context, such as the 'history' column in SHP }

RESPONSE A: { first possible continuation }

RESPONSE B: { second possible continuation }

Which response is better? RESPONSE

The output generated by SteamSHP will either be A or B.

If the input exceeds the 512 token limit, you can use pybsd to break the input up into sentences and only include that fits into 512 tokens.

Biases and Limitations

Biases in the datasets used to train SteamSHP may be propagated downstream to the model predictions. Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language. Reddit users on the subreddits covered by SHP are also not representative of the broader population. They are disproportionately from developed, Western, and English-speaking countries.

It is also worth noting that the more preferred response in SHP or HH-RLHF is not necessarily the more correct one -- they just reflect a preference. Past work by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.

Contact

Please contact [email protected] if you have any questions about the model. This dataset was created by Kawin Ethayarajh, Heidi (Chenyu) Zhang, Yizhong Wang, and Dan Jurafsky.

Citation

We will have a paper out soon, but until then, please cite:

@online{SHP,
  author = {Ethayarajh, Kawin and Zhang, Heidi and Wang, Yizhong and Jurafsky, Dan},
  title = {Stanford Human Preferences Dataset},
  year = 2023,
  url = {https://huggingface.co./datasets/stanfordnlp/SHP},
}