Magpie: Generating High-Quality Synthetic Data with Open-Source Large Language Models

Xu, Zhangchen

Magpie: Generating High-Quality Synthetic Data with Open-Source Large Language Models

dc.contributor.advisor	Poovendran, Radha
dc.contributor.author	Xu, Zhangchen
dc.date.accessioned	2025-08-01T22:21:44Z
dc.date.available	2025-08-01T22:21:44Z
dc.date.issued	2025-08-01
dc.date.submitted	2025
dc.description	Thesis (Master's)--University of Washington, 2025
dc.description.abstract	High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets.Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the pre-query templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We further introduce extensions of Magpie for filtering, generating multi-turn, preference optimization, domain-specific and multilingual datasets. We perform a comprehensive analysis of the Magpie-generated data. To compare Magpie-generated data with other public instruction datasets (e.g., ShareGPT, WildChat, Evol-Instruct, UltraChat, OpenHermes, Tulu-V2-Mix, GenQA), we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that using Magpie for supervised fine-tuning (SFT) solely can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. We also show that in some tasks, models supervised fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through SFT and subsequent preference optimization. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Xu_washington_0250O_28543.pdf
dc.identifier.uri	https://hdl.handle.net/1773/53563
dc.language.iso	en_US
dc.rights	CC BY-NC
dc.subject	Alignment
dc.subject	Large Languague Models
dc.subject	Post-Training
dc.subject	Synthetic Data Generation
dc.subject	Artificial intelligence
dc.subject	Computer science
dc.subject.other	Electrical and computer engineering
dc.title	Magpie: Generating High-Quality Synthetic Data with Open-Source Large Language Models
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Xu_washington_0250O_28543.pdf
Size:: 3.8 MB
Format:: Adobe Portable Document Format

Download

Collections

Electrical and computer engineering