Data-Centric Methods for Decentralizing Large Language Models
| dc.contributor.advisor | Smith, Noah A | |
| dc.contributor.advisor | Zettlemoyer, Luke | |
| dc.contributor.author | Gururangan, Suchin | |
| dc.date.accessioned | 2024-04-26T23:19:24Z | |
| dc.date.available | 2024-04-26T23:19:24Z | |
| dc.date.issued | 2024-04-26 | |
| dc.date.submitted | 2024 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2024 | |
| dc.description.abstract | Large language models (LMs) rely on massive textual datasets crawled from the Internet. In this thesis, I argue that many fundamental limitations of LMs (e.g., extreme costs, legal risks, and harmful behavior) are a direct result of monolithic, centralized, and homogeneous treatment of data. I first deconstruct the notion of a general-purpose corpus; I empirically show that current pretraining corpora implicitly favor text from the most powerful authors in society, and cannot feasibly represent all possible downstream use cases. Given this result, I highlight the importance of customizing LMs to new language variations using adaptive pretraining. I then propose a new class of LMs that are fundamentally decentralized, where components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. These new models address the limitations of centralization by being rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to these proposals are their data-centric nature; for example, I carefully explore what constitutes the domains to which experts specialize, and reflect on the data sources used to train LMs. I close by describing avenues for future work on decentralization techniques, with a focus on providing options for data opt-out, efficient customization, and cheaper scaling into massive, heterogeneous datasets. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Gururangan_washington_0250E_26513.pdf | |
| dc.identifier.uri | http://hdl.handle.net/1773/51332 | |
| dc.language.iso | en_US | |
| dc.rights | CC BY | |
| dc.subject | Language models | |
| dc.subject | Natural Language Processing | |
| dc.subject | Artificial intelligence | |
| dc.subject.other | Computer science and engineering | |
| dc.title | Data-Centric Methods for Decentralizing Large Language Models | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Gururangan_washington_0250E_26513.pdf
- Size:
- 7.1 MB
- Format:
- Adobe Portable Document Format
