Update README.md
Browse files
README.md
CHANGED
@@ -50,42 +50,25 @@ The final **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-
|
|
50 |
|
51 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
|
52 |
|
|
|
|
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
57 |
|
58 |
-
|
59 |
-
2. Large-scale synthetic data generator
|
60 |
-
3. Two-phased-training with replay buffers
|
61 |
|
62 |
-

|
63 |
|
64 |
-
|
65 |
|
66 |
-
|
67 |
|
68 |
-
|
69 |
|
70 |
-
|
71 |
-
This makes the teacher model better exploit the task distributions defined by the local examples of each node and the diversity in the taxonomy itself ensures the entire generation covers a wide range of tasks, as illustrated below. In turns, this allows for using Mixtral 8x7B as the teacher model for generation while performing very competitively with models such as ORCA-2, WizardLM, and Zephyr Beta that rely on synthetic data generated by much larger and capable models like GPT-4.
|
72 |
|
73 |
-
|
74 |
-
|
75 |
-
For adding new domain-specific knowledge, we provide an external knowledge source (document) and prompt the model to generate questions and answers based on the document.
|
76 |
-
Foundational skills such as reasoning and compositional skills such as creative writing are generated through in-context learning using the seed examples from the taxonomy.
|
77 |
-
|
78 |
-
Additionally, to ensure the data is high-quality and safe, we employ steps to check the questions and answers to ensure that they are grounded and safe. This is done using the same teacher model that generated the data.
|
79 |
-
|
80 |
-
Our training consists of two major phases: knowledge tuning and skills tuning.
|
81 |
-
There are two steps in knowledge tuning where the first step learns simple knowledge (short samples) and the second step learns complicated knowledge (longer samples).
|
82 |
-
The second step uses replay a replay buffer with data from the first step.
|
83 |
-
Both foundational skills and compositional skills are learned during the skills tuning phases, where a replay buffer of data from the knowledge phase is used.
|
84 |
-
Importantly, we use a set of hyper-parameters for training that are very different from standard small-scale supervised fine-training: larger batch size and carefully optimized learning rate and scheduler.
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-

|
89 |
|
90 |
## Model description
|
91 |
|
|
|
50 |
|
51 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
|
52 |
|
53 |
+
Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
|
54 |
+
[^1]: Lambert, 2024. *RewardBench: Evaluating Reward Models for Language Modeling*.
|
55 |
|
56 |
+
We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
|
57 |
|
58 |
+
Having Mixtral log-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample \( N \) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.
|
59 |
|
60 |
+
The prompts space for preference tuning were uniformly sampled by source from the LAB SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
|
|
|
|
|
61 |
|
|
|
62 |
|
63 |
+
### Discussion
|
64 |
|
65 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/Vt0eldYNUW1vOpLBd-_DI.png" width="650">
|
66 |
|
67 |
+
The preference tuned version of Merlinite-7B shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
|
68 |
|
69 |
+
We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores. The reward score of Best-of-N sampled batch improved until Rejection Sampling Round-2. Model seems to saturates, no longer gives improvements on either MT-Bench nor Mixtral-DPO rewards.
|
|
|
70 |
|
71 |
+
We observed RL saturation in the second RL-iteration, with a fixed reward. The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
## Model description
|
74 |
|