Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
davidberenstein1957Β 
posted an update Mar 22
Post
1624
πŸ”₯πŸ†•πŸ†•πŸ”₯ Dataset Drop: 4 KTO signal transformed versions of the highly loved Argilla DPO datasets.

KTO formats for:
- UltraFeedback Cleaned Binarized
- Distilabel Intel Orca
- Distilabel Capybara
- DPO mix

argilla/preference-datasets-for-kto-65f98314d7c1b04ab54d41a7

Paper claims :)

https://arxiv.org/abs/2402.01306

KTO matches or exceeds DPO performance at scales from 1B to 30B parameters.1 That is, taking a preference dataset of n DPO pairs and breaking it up into 2n examples for KTO can yield better generations, despite the model ostensibly learning from a weaker signal.

KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success thus cannot be ascribed to the alignment data being sourced from a preference dataset.

When the pretrained model is sufficiently good, one can skip supervised finetuning and go straight to KTO without a loss in generation quality. In contrast, we find that without doing SFT first, DPO-aligned models are significantly worse at all scales.

Do you need something custom? Take a look at @davanstrien his guide on creating your own KTO dataset with Argilla and our community.

https://github.com/huggingface/data-is-better-together/tree/main/kto-preference