Synthetic Data papers - a kd303 Collection

kd303 's Collections

Reasoning-lastest

code

Models

RAG

Synthetic Data papers

Agents

Synthetic Data papers

updated Dec 10, 2024

Papers and important approraches for generation of synthetic data

AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3, 2024 • 51
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published Jun 12, 2024 • 66
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22, 2024 • 254
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Paper • 2402.10379 • Published Feb 16, 2024 • 30
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Paper • 2401.16380 • Published Jan 29, 2024 • 48
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11, 2024 • 29
WizardLM: Empowering Large Language Models to Follow Complex Instructions

Paper • 2304.12244 • Published Apr 24, 2023 • 14
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20, 2024 • 47
Textbooks Are All You Need

Paper • 2306.11644 • Published Jun 20, 2023 • 143
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28, 2024 • 98
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Paper • 2404.18796 • Published Apr 29, 2024 • 68
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Paper • 2405.20541 • Published May 30, 2024 • 22
Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper • 2404.14361 • Published Apr 22, 2024 • 2
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Paper • 2409.08239 • Published Sep 12, 2024 • 17
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 33
Improving Text Embeddings with Large Language Models

Paper • 2401.00368 • Published Dec 31, 2023 • 79
Magicoder: Source Code Is All You Need

Paper • 2312.02120 • Published Dec 4, 2023 • 80
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Paper • 2403.13638 • Published Mar 20, 2024
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Paper • 2401.00690 • Published Jan 1, 2024 • 1
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Paper • 2402.10176 • Published Feb 15, 2024 • 36
SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning

Paper • 2401.07950 • Published Jan 15, 2024 • 4
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Paper • 2312.06585 • Published Dec 11, 2023 • 28
Simple synthetic data reduces sycophancy in large language models

Paper • 2308.03958 • Published Aug 7, 2023 • 21
CodecLM: Aligning Language Models with Tailored Synthetic Data

Paper • 2404.05875 • Published Apr 8, 2024 • 17