Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 27 days ago • 62
ProX Refining Models Collection Adapted small language models used to generate data refining programs • 5 items • Updated Oct 10 • 2
Magpie-Qwen2 Datasets Collection Dataset built with Qwen2 72B and Qwen2 7B. • 6 items • Updated Sep 14 • 10
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 87
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence Paper • 2406.11931 • Published Jun 17 • 57
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published Jun 20 • 86
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 69
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct Paper • 2405.14906 • Published May 23 • 23