llama-2-13b-reward-oasst1
This model is a fine-tuned version of meta-llama/Llama-2-13b-chat-hf on the tasksource/oasst1_pairwise_rlhf_reward dataset. It achieves the following results on the evaluation set:
- Loss: 0.4810
- Accuracy: 0.7869
See also vincentmin/llama-2-7b-reward-oasst1 for a 7b version of this model.
Model description
This is a reward model trained with QLoRA in 4bit precision. The base model is meta-llama/Llama-2-13b-chat-hf for which you need to have accepted the license in order to be able use it. Once you've been given permission, you can load the reward model as follows:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
peft_model_id = "vincentmin/llama-2-13b-reward-oasst1"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
config.base_model_name_or_path,
num_labels=1,
load_in_4bit=True,
torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_auth_token=True)
model.eval()
with torch.no_grad():
reward = model(**tokenizer("prompter: hello world. assistant: foo bar", return_tensors='pt')).logits
reward
For best results, one should use the prompt format used during training:
prompt = "prompter: <prompt_1> assistant: <response_1> prompter: <prompt_2> ..."
Please use a version of peft where #755 has been merged to make sure the model is loaded correctly. You can install peft
with pip install git+https://github.com/huggingface/peft.git
to make sure this is the case.
Intended uses & limitations
Since the model was trained on oasst1 data, the reward will reflect any biases present in the oasst1 data.
Training and evaluation data
The model was trained using QLoRA and the trl
library's RewardTrainer
on the tasksource/oasst1_pairwise_rlhf_reward dataset where examples with more than 512 tokens were filtered out from both the training and eval data.
Training procedure
Training hyperparameters
The following bitsandbytes
quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: False
- bnb_4bit_compute_dtype: float16
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
- max_seq_length: 512
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
0.5602 | 0.08 | 250 | 0.5436 | 0.7388 |
0.6166 | 0.17 | 500 | 0.5340 | 0.7468 |
0.6545 | 0.25 | 750 | 0.4899 | 0.7644 |
0.5635 | 0.33 | 1000 | 0.4877 | 0.7532 |
0.5933 | 0.42 | 1250 | 0.4930 | 0.7660 |
0.5758 | 0.5 | 1500 | 0.4851 | 0.7740 |
0.5212 | 0.58 | 1750 | 0.5021 | 0.7788 |
0.5251 | 0.67 | 2000 | 0.4893 | 0.7804 |
0.5145 | 0.75 | 2250 | 0.4924 | 0.7853 |
0.5085 | 0.83 | 2500 | 0.4934 | 0.7853 |
0.617 | 0.92 | 2750 | 0.4803 | 0.7821 |
0.5525 | 1.0 | 3000 | 0.4810 | 0.7869 |
Framework versions
- PEFT 0.5.0.dev0 (with https://github.com/huggingface/peft/pull/755)
- Transformers 4.32.0.dev0
- Pytorch 2.0.1+cu118
- Datasets 2.14.0
- Tokenizers 0.13.3
- Downloads last month
- 4
Model tree for vincentmin/llama-2-13b-reward-oasst1
Base model
meta-llama/Llama-2-13b-chat-hf