This is openchat/openchat-3.5-0106, tuned with DPO on a subset Nectar. This time with 5000 steps, a full epoch.
Careful attention was paid to make sure the chat template was followed properly.
Data selection and filtering:
- filtered dataset to only include examples with multiple turns, to preserve strength in multi-turn scenarios
- used the 4th ranking response as the "rejected" instead of the 3rd. When I inspected the dataset, I frequently could not find any meaningful difference in quality between the 1st and 3rd ranked responses, so to make the accepted/rejected signal extra clear, I replaced 3rd ranking with 4th ranking.
- I filtered out any examples with "good_natured == False". Why? When I inspected examples with "good_natured == False" in the Nectar dataset, I noticed they frequently include refusals from even the top ranking model. So, counter-intuitively, including "bad natured" entries might actually censor the model more, since the top responses (as ranked by GPT-4) to these queries tend to be refusals. Not to mention, the quality of the conversations that are "bad natured" tends to be worse in general, in my own opinion.
Differences from 0.4:
- Trained on 5000 steps instead of 500, with a lower learning rate and slower warmup period.
Summary of versions:
- 200 steps, no filtering on Nectar dataset, 5e-5 learning rate
- empty repo, failed training. ignore it
- 500 steps, no filtering on Nectar dataset, 5e-5 learning rate (same as 1 but with more steps)
- 500 steps, filtered dataset to only include multi-chat-turn examples, used 4th ranking response as the "rejected" instead of 3rd, filtered out "good_natured=False", 5e-5 learning rate
- 5000 steps (over a full epoch), filtered dataset to only include multi-chat-turn examples, used 4th ranking response as the "rejected" instead of 3rd, filtered out "good_natured=False", 5e-6 learning rate. Same as 0.4 but with 10x the steps, and 1/10th the learning rate
- 500 steps, filtered dataset to only include multi-chat-turn examples, used 4th ranking response as the "rejected" instead of 3rd, filtered out "good_natured=False", 5e-5 learning rate. Same as 0.5 but with 1/10th the steps, and 10x the learning rate
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 69.67 |
AI2 Reasoning Challenge (25-Shot) | 66.72 |
HellaSwag (10-Shot) | 83.53 |
MMLU (5-Shot) | 65.36 |
TruthfulQA (0-shot) | 52.15 |
Winogrande (5-shot) | 82.08 |
GSM8k (5-shot) | 68.16 |
- Downloads last month
- 17
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for andysalerno/openchat-nectar-0.5
Dataset used to train andysalerno/openchat-nectar-0.5
Evaluation results
- normalized accuracy on AI2 Reasoning Challenge (25-Shot)test set Open LLM Leaderboard66.720
- normalized accuracy on HellaSwag (10-Shot)validation set Open LLM Leaderboard83.530
- accuracy on MMLU (5-Shot)test set Open LLM Leaderboard65.360
- mc2 on TruthfulQA (0-shot)validation set Open LLM Leaderboard52.150
- accuracy on Winogrande (5-shot)validation set Open LLM Leaderboard82.080
- accuracy on GSM8k (5-shot)test set Open LLM Leaderboard68.160