Haoxiang-Wang commited on
Commit
f39a6b6
1 Parent(s): 5131470

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ ---
4
+
5
+ # Arbitrary-Rating Multi-Objective Reward Model (ArmoRM) with Mixture-of-Experts (MoE) Aggregation of Reward Objectives
6
+
7
+
8
+
9
+ + **Authors** (* indicates equal contribution)
10
+
11
+ [Haoxiang Wang*](https://haoxiang-wang.github.io/), [Wei Xiong*](https://weixiongust.github.io/WeiXiongUST/index.html), [Tengyang Xie](https://tengyangxie.github.io/), [Han Zhao](https://hanzhaoml.github.io/), [Tong Zhang](https://tongzhang-ml.org/)
12
+
13
+ + **Blog**: To appear soon (with implementation details)
14
+ + **Tech Report**: To be released in June 2024
15
+ + **Model**: [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1)
16
+ + Finetuned from model: [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1)
17
+ - **Code Repository:** https://github.com/RLHFlow/RLHF-Reward-Modeling/
18
+ + **Architecture**
19
+
20
+ <p align="center">
21
+ <img width="800" alt="image" src="https://github.com/RLHFlow/RLHFlow.github.io/blob/main/assets/ArmoRM-MoE.png?raw=true">
22
+ </p>
23
+
24
+ ## RewardBench LeaderBoard
25
+
26
+ | Base Model | Method | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
27
+ |:-----------|:-------|:-----:|:-----|:----------|:-------|:----------|:-----------------------|
28
+ | ArmoRM-Llama3-8B-v0.1 | Llama-3 8B | ArmoRM + MoE | **88.97** | 96.9 | **76.8** | **92.2** | **97.3** | 74.3 |
29
+ | Cohere May 2024 | Unknown | Unknown | 88.25 | 96.4 | 71.3 | **92.7** | **97.7** | **78.2** |
30
+ | GPT-4 Turbo (0125 version) | GPT-4 Turbo | LLM-as-a-Judge | 84.25 | 95.3 | 74.3 | 87.2 | 86.9 | 70.9 |
31
+ | [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) | Llama-3 8B | Bradley-Terry | 83.61 | **99.4** | 65.1 | 87.8 | 86.4 | 74.9 |
32
+ | [Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B) | Yi-34B | Bradley-Terry | 81.44 | 96.9 | 57.2 | 88.2 | 88.5 | 71.4 |
33
+
34
+ ## Demo Code
35
+ ```python
36
+ import torch
37
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
38
+ device = "cuda"
39
+ path = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
40
+ model = AutoModelForSequenceClassification.from_pretrained(path, device_map=device,
41
+ trust_remote_code=True, torch_dtype=torch.bfloat16)
42
+ tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)
43
+ # We load a random sample from the validation set of the HelpSteer dataset
44
+ prompt = 'What are some synonyms for the word "beautiful"?'
45
+ response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
46
+ messages = [{"role": "user", "content": prompt},
47
+ {"role": "assistant", "content": response}]
48
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
49
+ with torch.no_grad():
50
+ output = model(input_ids)
51
+ # Multi-objective rewards for the response
52
+ multi_obj_rewards = output.rewards.cpu().float()
53
+ # The gating layer's output is conditioned on the prompt
54
+ gating_output = output.gating_output.cpu().float()
55
+ # The preference score for the response, aggregated from the
56
+ # multi-objective rewards with the gating layer
57
+ preference_score = output.score.cpu().float()
58
+ # We apply a transformation matrix to the multi-objective rewards
59
+ # before multiplying with the gating layer's output. This mainly aims
60
+ # at reducing the verbosity bias of the original reward objectives
61
+ obj_transform = model.reward_transform_matrix.data.cpu().float()
62
+ # The final coefficients assigned to each reward objective
63
+ multi_obj_coeffs = gating_output @ obj_transform.T
64
+ # The preference score is the linear combination of the multi-objective rewards with
65
+ # the multi-objective coefficients, which can be verified by the following assertion
66
+ assert torch.isclose(torch.sum(multi_obj_rewards * multi_obj_coeffs, dim=1), preference_score, atol=1e-3)
67
+ # Find the top-K reward objectives with coefficients of the highest magnitude
68
+ K = 3
69
+ top_obj_dims = torch.argsort(torch.abs(multi_obj_coeffs), dim=1, descending=True,)[:, :K]
70
+ top_obj_coeffs = torch.gather(multi_obj_coeffs, dim=1, index=top_obj_dims)
71
+
72
+ # The attributes of the 19 reward objectives
73
+ attributes = ['helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
74
+ 'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
75
+ 'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
76
+ 'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
77
+ 'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
78
+ 'code-style','code-explanation','code-instruction-following','code-readability']
79
+
80
+ example_index = 0
81
+ for i in range(K):
82
+ attribute = attributes[top_obj_dims[example_index, i].item()]
83
+ coeff = top_obj_coeffs[example_index, i].item()
84
+ print(f"{attribute}: {round(coeff,5)}")
85
+ # code-complexity: 0.19922
86
+ # helpsteer-verbosity: -0.10864
87
+ # ultrafeedback-instruction_following: 0.07861
88
+
89
+ # The actual rewards of this example from the HelpSteer dataset
90
+ # are [3,3,4,2,2] for the five helpsteer objectives:
91
+ # helpfulness, correctness, coherence, complexity, verbosity
92
+ # We can linearly transform our predicted rewards to the
93
+ # original reward space to compare with the ground truth
94
+ helpsteer_rewards_pred = multi_obj_rewards[0, :5] * 5 - 0.5
95
+ print(helpsteer_rewards_pred)
96
+ # [2.78125 2.859375 3.484375 1.3847656 1.296875 ]
97
+ ```
98
+
99
+ ## Citation
100
+
101
+ If you find this work useful for your research, please consider citing:
102
+ ```
103
+ @misc{wang2024interpretable,
104
+ title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
105
+ author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
106
+ year={2024}
107
+ }
108
+
109
+ @inproceedings{wang2024arithmetic,
110
+ title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards},
111
+ author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
112
+ year={2024},
113
+ booktitle={ACL},
114
+ }
115
+ ```
116
+ The second entry, "[Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards](https://arxiv.org/abs/2402.18571)", is another recent work of ours that trained a multi-objective reward model and adopted it for LLM alignment, which motivated us to develop the current work.