Mghao
commited on
Commit
•
6a2c761
1
Parent(s):
2ab0c2d
Update README.md
Browse files
INF.jpg
ADDED
README.md
CHANGED
@@ -5,14 +5,59 @@ datasets:
|
|
5 |
- infly/INF-ORM-Preference-Magnitude-80K
|
6 |
pipeline_tag: text-classification
|
7 |
---
|
|
|
|
|
8 |
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
# INF Outcome Reward Model
|
11 |
## Introduction
|
12 |
|
13 |
-
[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
-
**Note: Train Details are coming soon!**
|
16 |
|
17 |
## RewardBench Leaderboard
|
18 |
|
@@ -20,7 +65,7 @@ We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/rew
|
|
20 |
|
21 |
| Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
|
22 |
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|
23 |
-
| 1 | **infly/INF-ORM-Llama3.1-70B** |
|
24 |
| 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
|
25 |
| 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
|
26 |
| 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
|
@@ -135,20 +180,19 @@ print(f"Score for response 1: {score1}")
|
|
135 |
print(f"Score for response 2: {score2}")
|
136 |
|
137 |
# Output:
|
138 |
-
|
139 |
# Score for response 1: 4.96875
|
140 |
# Score for response 2: 2.890625
|
141 |
|
142 |
```
|
143 |
|
144 |
-
##
|
|
|
145 |
|
146 |
-
|
|
|
147 |
|
148 |
-
|
|
|
149 |
|
150 |
-
## Contact
|
151 |
-
If you have any questions, please feel free to reach us at <[email protected]>.
|
152 |
-
## Citation
|
153 |
|
154 |
|
|
|
5 |
- infly/INF-ORM-Preference-Magnitude-80K
|
6 |
pipeline_tag: text-classification
|
7 |
---
|
8 |
+
<div align="center">
|
9 |
+
<img src="INF.jpg" width="300"/>
|
10 |
|
11 |
+
🤗 <a href="https://huggingface.co/infly" target="_blank">Hugging Face</a>
|
12 |
+
<br>
|
13 |
+
<br>
|
14 |
+
<br>
|
15 |
+
</div>
|
16 |
|
17 |
# INF Outcome Reward Model
|
18 |
## Introduction
|
19 |
|
20 |
+
[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).
|
21 |
+
|
22 |
+
We did the following three things to improve the performance of our model.
|
23 |
+
### Data Pre-processing
|
24 |
+
We trained it on the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K), which is derived from the **decontaminated dataset** [Skywork/Skywork-Reward-Perference-80k-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2).
|
25 |
+
|
26 |
+
We use GPT-4o to evaluate the difference between the chosen answer and the rejected answer in the [Skywork/Skywork-Reward-Perference-80k-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2) and then add the 'Magnitude' column in the dataset.
|
27 |
+
|
28 |
+
The evaluation follows the following rules:
|
29 |
+
1. If the chosen answer is much better than rejected answer, set 'Magnitude' value $d$ to 3.
|
30 |
+
2. If the chosen answer is better than the rejected answer, set 'Magnitude' value $d$ to 2.
|
31 |
+
3. If the chosen answer is slightly better than rejected answer, set 'Magnitude' value $d$ to 1.
|
32 |
+
|
33 |
+
After that, we train our model with the scaled BT loss. The scaled BT loss is defined as:
|
34 |
+
$$\mathcal{L}_{Scaled-BT} = -\alpha*d*log(\sigma(r_{\theta}(x, y_{c}))-\sigma(r_{\theta}(x, y_{r})))$$
|
35 |
+
where $\alpha$ is the scaling factor. You can find more details about scaled BT loss here [1](https://arxiv.org/pdf/2410.01257).
|
36 |
+
|
37 |
+
> Here we look at the performance gains of scaled BT loss from a different perspective than [1](https://arxiv.org/pdf/2410.01257). The scaled BT loss can be thought of as a form of cross-entropy, where the distribution of the difference of the logits produced by the model is sensitive to the distribution of the magnitude. Therefore, we improve the difference of the values in the 'Magnitude' column from 1, 2, 3 to 1, 3, 10 and finally get better performance.
|
38 |
+
|
39 |
+
### Modified Score Head
|
40 |
+
We use the modified score head instead of origin score head.
|
41 |
+
```python
|
42 |
+
# modified score head
|
43 |
+
self.score = nn.Sequential(
|
44 |
+
nn.Linear(config.hidden_size, config.hidden_size),
|
45 |
+
nn.ReLU(),
|
46 |
+
nn.Linear(config.hidden_size, 1)
|
47 |
+
)
|
48 |
+
# origin score head
|
49 |
+
self.score = nn.linear(config.hidden_size, 1)
|
50 |
+
```
|
51 |
+
|
52 |
+
### Model Merge
|
53 |
+
We trained two models and merge them with the weight $0.5$.
|
54 |
+
| Model | Score | Chat | Chat Hard | Safety | Reasoning |
|
55 |
+
| ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|
56 |
+
| INF-ORM-v1 | 94.3 | 96.1 | 88.2 | 94.6 | 98.2 |
|
57 |
+
| INF-ORM-v2 | 94.4 | 95.5 | 90.8 | 93 | 99.1 |
|
58 |
+
| INF-ORM-v3(Averaged) | 95.1 | 96.6 | 91.0 | 93.6 | 99.1 |
|
59 |
+
|
60 |
|
|
|
61 |
|
62 |
## RewardBench Leaderboard
|
63 |
|
|
|
65 |
|
66 |
| Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
|
67 |
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|
68 |
+
| 1 | **infly/INF-ORM-Llama3.1-70B** | Seq. Classifier | 95.1 | 96.6 | 91.0 | 93.6 | 99.1 |
|
69 |
| 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
|
70 |
| 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
|
71 |
| 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
|
|
|
180 |
print(f"Score for response 2: {score2}")
|
181 |
|
182 |
# Output:
|
|
|
183 |
# Score for response 1: 4.96875
|
184 |
# Score for response 2: 2.890625
|
185 |
|
186 |
```
|
187 |
|
188 |
+
## License Agreement
|
189 |
+
INF-ORM-Llama3.1-70B support commercial applications under a permissive [License](https://huggingface.co/infly/INF-ORM-Llama3.1-70B/blob/main/LICENSE).
|
190 |
|
191 |
+
## Contact
|
192 |
+
If you have any questions, please feel free to reach us at <[email protected]>, <[email protected]> and <[email protected]>.
|
193 |
|
194 |
+
## Acknowledgement
|
195 |
+
This work was done during my internship at INF. I would like to thank my mentor (quchao, tanxiaoyu) and the INF team for their support. Their insights and expertise greatly contributed to the successful completion of this work.
|
196 |
|
|
|
|
|
|
|
197 |
|
198 |
|