mapama247
's Collections
Synthetic Data Generation
updated
Textbooks Are All You Need
Paper
•
2306.11644
•
Published
•
142
Textbooks Are All You Need II: phi-1.5 technical report
Paper
•
2309.05463
•
Published
•
87
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
•
2406.20094
•
Published
•
94
Instruction Pre-Training: Language Models are Supervised Multitask
Learners
Paper
•
2406.14491
•
Published
•
85
Improving Text Embeddings with Large Language Models
Paper
•
2401.00368
•
Published
•
79
Adapting Large Language Models via Reading Comprehension
Paper
•
2309.09530
•
Published
•
75
Magicoder: Source Code Is All You Need
Paper
•
2312.02120
•
Published
•
79
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
Models
Paper
•
2401.01335
•
Published
•
64
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs
with Nothing
Paper
•
2406.08464
•
Published
•
62
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with
Refined Data Generation
Paper
•
2312.14187
•
Published
•
49
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models
Paper
•
2402.13064
•
Published
•
46
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
Paper
•
2401.16380
•
Published
•
47
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper
•
2407.03502
•
Published
•
43
Self-Alignment with Instruction Backtranslation
Paper
•
2308.06259
•
Published
•
40
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
•
2410.09584
•
Published
•
42
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
•
2402.10176
•
Published
•
34
TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?
Paper
•
2305.07759
•
Published
•
31
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
•
2402.10379
•
Published
•
29
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
•
2404.07503
•
Published
•
29
Beyond Human Data: Scaling Self-Training for Problem-Solving with
Language Models
Paper
•
2312.06585
•
Published
•
28
Becoming self-instruct: introducing early stopping criteria for minimal
instruct tuning
Paper
•
2307.03692
•
Published
•
24
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
•
2307.08701
•
Published
•
22
Simple synthetic data reduces sycophancy in large language models
Paper
•
2308.03958
•
Published
•
21
CodecLM: Aligning Language Models with Tailored Synthetic Data
Paper
•
2404.05875
•
Published
•
16
Source2Synth: Synthetic Data Generation and Curation Grounded in Real
Data Sources
Paper
•
2409.08239
•
Published
•
15
WizardLM: Empowering Large Language Models to Follow Complex
Instructions
Paper
•
2304.12244
•
Published
•
13
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task
Adaptation
Paper
•
2402.18334
•
Published
•
12
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Paper
•
2408.03256
•
Published
•
10
Self-Instruct: Aligning Language Model with Self Generated Instructions
Paper
•
2212.10560
•
Published
•
7
Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations
Paper
•
2305.14233
•
Published
•
6
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in
Large Language Models
Paper
•
2406.16783
•
Published
•
4
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
•
2404.14361
•
Published
•
1
Impossible Distillation: from Low-Quality Model to High-Quality Dataset
& Model for Summarization and Paraphrasing
Paper
•
2305.16635
•
Published
•
1
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated
Chatbot Arena
Paper
•
2407.10627
•
Published
•
1
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
Paper
•
2202.07922
•
Published
•
1
Generative AI for Synthetic Data Generation: Methods, Challenges and the
Future
Paper
•
2403.04190
•
Published
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
Survey
Paper
•
2406.15126
•
Published
Large Language Models for Data Annotation: A Survey
Paper
•
2402.13446
•
Published
Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias
Paper
•
2306.15895
•
Published
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data
Generated by Large Language Models
Paper
•
2404.14445
•
Published
TarGEN: Targeted Data Generation with Large Language Models
Paper
•
2310.17876
•
Published