12 10 66

Leon Lee

Leon-Leee

yucc-leon

AI & ML interests

LLMs, code generation, chatbot, workflows

Recent Activity

liked a dataset 2 days ago

m-a-p/FineFineWeb

liked a Space 12 days ago

HuggingFaceFW/discussion

liked a dataset 20 days ago

CASIA-LM/ChineseWebText2.0

View all activity

Organizations

Leon-Leee's activity

upvoted a collection about 1 month ago

Tulu 3 Datasets

Collection

All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 27 days ago • 62

upvoted a collection 3 months ago

ProX Refining Models

Collection

Adapted small language models used to generate data refining programs • 5 items • Updated Oct 10 • 2

upvoted a collection 5 months ago

Magpie-Qwen2 Datasets

Collection

Dataset built with Qwen2 72B and Qwen2 7B. • 6 items • Updated Sep 14 • 10

upvoted 3 papers 6 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25 • 87

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Paper • 2406.11931 • Published Jun 17 • 57

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20 • 86

upvoted 2 articles 7 months ago

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 69

Article

LLM数据工程3——数据收集魔法：获取顶级训练数据的方法

•

Jun 4

• 14

upvoted 2 papers 7 months ago

StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29 • 136

AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct

Paper • 2405.14906 • Published May 23 • 23