Magpie: Generating High-Quality Synthetic Data with Open-Source Large Language Models

dc.contributor.advisorPoovendran, Radha
dc.contributor.authorXu, Zhangchen
dc.date.accessioned2025-08-01T22:21:44Z
dc.date.available2025-08-01T22:21:44Z
dc.date.issued2025-08-01
dc.date.submitted2025
dc.descriptionThesis (Master's)--University of Washington, 2025
dc.description.abstractHigh-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets.Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the pre-query templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We further introduce extensions of Magpie for filtering, generating multi-turn, preference optimization, domain-specific and multilingual datasets. We perform a comprehensive analysis of the Magpie-generated data. To compare Magpie-generated data with other public instruction datasets (e.g., ShareGPT, WildChat, Evol-Instruct, UltraChat, OpenHermes, Tulu-V2-Mix, GenQA), we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that using Magpie for supervised fine-tuning (SFT) solely can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. We also show that in some tasks, models supervised fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through SFT and subsequent preference optimization. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherXu_washington_0250O_28543.pdf
dc.identifier.urihttps://hdl.handle.net/1773/53563
dc.language.isoen_US
dc.rightsCC BY-NC
dc.subjectAlignment
dc.subjectLarge Languague Models
dc.subjectPost-Training
dc.subjectSynthetic Data Generation
dc.subjectArtificial intelligence
dc.subjectComputer science
dc.subject.otherElectrical and computer engineering
dc.titleMagpie: Generating High-Quality Synthetic Data with Open-Source Large Language Models
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Xu_washington_0250O_28543.pdf
Size:
3.8 MB
Format:
Adobe Portable Document Format