Data-Centric Methods for Decentralizing Large Language Models

Gururangan, Suchin

Data-Centric Methods for Decentralizing Large Language Models

dc.contributor.advisor	Smith, Noah A
dc.contributor.advisor	Zettlemoyer, Luke
dc.contributor.author	Gururangan, Suchin
dc.date.accessioned	2024-04-26T23:19:24Z
dc.date.available	2024-04-26T23:19:24Z
dc.date.issued	2024-04-26
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Large language models (LMs) rely on massive textual datasets crawled from the Internet. In this thesis, I argue that many fundamental limitations of LMs (e.g., extreme costs, legal risks, and harmful behavior) are a direct result of monolithic, centralized, and homogeneous treatment of data. I first deconstruct the notion of a general-purpose corpus; I empirically show that current pretraining corpora implicitly favor text from the most powerful authors in society, and cannot feasibly represent all possible downstream use cases. Given this result, I highlight the importance of customizing LMs to new language variations using adaptive pretraining. I then propose a new class of LMs that are fundamentally decentralized, where components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. These new models address the limitations of centralization by being rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to these proposals are their data-centric nature; for example, I carefully explore what constitutes the domains to which experts specialize, and reflect on the data sources used to train LMs. I close by describing avenues for future work on decentralization techniques, with a focus on providing options for data opt-out, efficient customization, and cheaper scaling into massive, heterogeneous datasets.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Gururangan_washington_0250E_26513.pdf
dc.identifier.uri	http://hdl.handle.net/1773/51332
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Language models
dc.subject	Natural Language Processing
dc.subject	Artificial intelligence
dc.subject.other	Computer science and engineering
dc.title	Data-Centric Methods for Decentralizing Large Language Models
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gururangan_washington_0250E_26513.pdf
Size:: 7.1 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering