Mghao commited on
Commit
6a2c761
1 Parent(s): 2ab0c2d

Update README.md

Browse files
Files changed (2) hide show
  1. INF.jpg +0 -0
  2. README.md +54 -10
INF.jpg ADDED
README.md CHANGED
@@ -5,14 +5,59 @@ datasets:
5
  - infly/INF-ORM-Preference-Magnitude-80K
6
  pipeline_tag: text-classification
7
  ---
 
 
8
 
 
 
 
 
 
9
 
10
  # INF Outcome Reward Model
11
  ## Introduction
12
 
13
- [**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- **Note: Train Details are coming soon!**
16
 
17
  ## RewardBench Leaderboard
18
 
@@ -20,7 +65,7 @@ We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/rew
20
 
21
  | Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
22
  | :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
23
- | 1 | **infly/INF-ORM-Llama3.1-70B** | Custom Classifier | 95.2 | 96.9 | 91.0 | 93.8 | 99.1 |
24
  | 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
25
  | 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
26
  | 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
@@ -135,20 +180,19 @@ print(f"Score for response 1: {score1}")
135
  print(f"Score for response 2: {score2}")
136
 
137
  # Output:
138
-
139
  # Score for response 1: 4.96875
140
  # Score for response 2: 2.890625
141
 
142
  ```
143
 
144
- ## Declaration and License Agreement
 
145
 
146
- ### Declaration
 
147
 
148
- ### License Agreement
 
149
 
150
- ## Contact
151
- If you have any questions, please feel free to reach us at <[email protected]>.
152
- ## Citation
153
 
154
 
 
5
  - infly/INF-ORM-Preference-Magnitude-80K
6
  pipeline_tag: text-classification
7
  ---
8
+ <div align="center">
9
+ <img src="INF.jpg" width="300"/>
10
 
11
+ 🤗 <a href="https://huggingface.co/infly" target="_blank">Hugging Face</a>
12
+ <br>
13
+ <br>
14
+ <br>
15
+ </div>
16
 
17
  # INF Outcome Reward Model
18
  ## Introduction
19
 
20
+ [**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).
21
+
22
+ We did the following three things to improve the performance of our model.
23
+ ### Data Pre-processing
24
+ We trained it on the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K), which is derived from the **decontaminated dataset** [Skywork/Skywork-Reward-Perference-80k-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2).
25
+
26
+ We use GPT-4o to evaluate the difference between the chosen answer and the rejected answer in the [Skywork/Skywork-Reward-Perference-80k-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2) and then add the 'Magnitude' column in the dataset.
27
+
28
+ The evaluation follows the following rules:
29
+ 1. If the chosen answer is much better than rejected answer, set 'Magnitude' value $d$ to 3.
30
+ 2. If the chosen answer is better than the rejected answer, set 'Magnitude' value $d$ to 2.
31
+ 3. If the chosen answer is slightly better than rejected answer, set 'Magnitude' value $d$ to 1.
32
+
33
+ After that, we train our model with the scaled BT loss. The scaled BT loss is defined as:
34
+ $$\mathcal{L}_{Scaled-BT} = -\alpha*d*log(\sigma(r_{\theta}(x, y_{c}))-\sigma(r_{\theta}(x, y_{r})))$$
35
+ where $\alpha$ is the scaling factor. You can find more details about scaled BT loss here [1](https://arxiv.org/pdf/2410.01257).
36
+
37
+ > Here we look at the performance gains of scaled BT loss from a different perspective than [1](https://arxiv.org/pdf/2410.01257). The scaled BT loss can be thought of as a form of cross-entropy, where the distribution of the difference of the logits produced by the model is sensitive to the distribution of the magnitude. Therefore, we improve the difference of the values in the 'Magnitude' column from 1, 2, 3 to 1, 3, 10 and finally get better performance.
38
+
39
+ ### Modified Score Head
40
+ We use the modified score head instead of origin score head.
41
+ ```python
42
+ # modified score head
43
+ self.score = nn.Sequential(
44
+ nn.Linear(config.hidden_size, config.hidden_size),
45
+ nn.ReLU(),
46
+ nn.Linear(config.hidden_size, 1)
47
+ )
48
+ # origin score head
49
+ self.score = nn.linear(config.hidden_size, 1)
50
+ ```
51
+
52
+ ### Model Merge
53
+ We trained two models and merge them with the weight $0.5$.
54
+ | Model | Score | Chat | Chat Hard | Safety | Reasoning |
55
+ | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
56
+ | INF-ORM-v1 | 94.3 | 96.1 | 88.2 | 94.6 | 98.2 |
57
+ | INF-ORM-v2 | 94.4 | 95.5 | 90.8 | 93 | 99.1 |
58
+ | INF-ORM-v3(Averaged) | 95.1 | 96.6 | 91.0 | 93.6 | 99.1 |
59
+
60
 
 
61
 
62
  ## RewardBench Leaderboard
63
 
 
65
 
66
  | Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
67
  | :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
68
+ | 1 | **infly/INF-ORM-Llama3.1-70B** | Seq. Classifier | 95.1 | 96.6 | 91.0 | 93.6 | 99.1 |
69
  | 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
70
  | 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
71
  | 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
 
180
  print(f"Score for response 2: {score2}")
181
 
182
  # Output:
 
183
  # Score for response 1: 4.96875
184
  # Score for response 2: 2.890625
185
 
186
  ```
187
 
188
+ ## License Agreement
189
+ INF-ORM-Llama3.1-70B support commercial applications under a permissive [License](https://huggingface.co/infly/INF-ORM-Llama3.1-70B/blob/main/LICENSE).
190
 
191
+ ## Contact
192
+ If you have any questions, please feel free to reach us at <[email protected]>, <[email protected]> and <[email protected]>.
193
 
194
+ ## Acknowledgement
195
+ This work was done during my internship at INF. I would like to thank my mentor (quchao, tanxiaoyu) and the INF team for their support. Their insights and expertise greatly contributed to the successful completion of this work.
196
 
 
 
 
197
 
198