Data-Centric Methods for Decentralizing Large Language Models

dc.contributor.advisorSmith, Noah A
dc.contributor.advisorZettlemoyer, Luke
dc.contributor.authorGururangan, Suchin
dc.date.accessioned2024-04-26T23:19:24Z
dc.date.available2024-04-26T23:19:24Z
dc.date.issued2024-04-26
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractLarge language models (LMs) rely on massive textual datasets crawled from the Internet. In this thesis, I argue that many fundamental limitations of LMs (e.g., extreme costs, legal risks, and harmful behavior) are a direct result of monolithic, centralized, and homogeneous treatment of data. I first deconstruct the notion of a general-purpose corpus; I empirically show that current pretraining corpora implicitly favor text from the most powerful authors in society, and cannot feasibly represent all possible downstream use cases. Given this result, I highlight the importance of customizing LMs to new language variations using adaptive pretraining. I then propose a new class of LMs that are fundamentally decentralized, where components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. These new models address the limitations of centralization by being rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to these proposals are their data-centric nature; for example, I carefully explore what constitutes the domains to which experts specialize, and reflect on the data sources used to train LMs. I close by describing avenues for future work on decentralization techniques, with a focus on providing options for data opt-out, efficient customization, and cheaper scaling into massive, heterogeneous datasets.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherGururangan_washington_0250E_26513.pdf
dc.identifier.urihttp://hdl.handle.net/1773/51332
dc.language.isoen_US
dc.rightsCC BY
dc.subjectLanguage models
dc.subjectNatural Language Processing
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleData-Centric Methods for Decentralizing Large Language Models
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Gururangan_washington_0250E_26513.pdf
Size:
7.1 MB
Format:
Adobe Portable Document Format