Accessible Foundation Models: Systems, Algorithms, and Science

Dettmers, Tim

Accessible Foundation Models: Systems, Algorithms, and Science

dc.contributor.advisor	Zettlemoyer, Luke
dc.contributor.author	Dettmers, Tim
dc.date.accessioned	2024-09-09T23:06:24Z
dc.date.available	2024-09-09T23:06:24Z
dc.date.issued	2024-09-09
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	The ever-increasing scale of foundation models, such as ChatGPT and AlphaFold, has revolutionized AI and science more generally. However, increasing scale also steadily raises computational barriers, blocking almost everyone from studying, adapting, or otherwise using these models for anything beyond static API queries. In this thesis, I will present research that significantly lowers these barriers for a wide range of use cases, including inference algorithms that are used to make predictions after training, finetuning approaches that adapt a trained model to new data, and finally, full training of foundation models from scratch. For inference, I will describe our LLM.int8() algorithm, which showed how to enable high-precision 8-bit matrix multiplication that is both fast and memory efficient. LLM.int8() is based on the discovery and characterization of sparse outlier sub-networks that only emerge at large model scales but are crucial for effective Int8 quantization. To empirically maximize inference efficiency for devices with limited memory footprint, I will present k-bit inference scaling laws, which empirically determine the quantization procedure to get the highest performance density per bit large language models (LLMs). The main finding of k-bit inference scaling laws is that, in almost all cases, 4-bit quantization is the most effective way of getting the best LLM performance for devices with limited memory. I will also discuss follow-up work, SpQR, which combines the insights about outlier structures from LLM.int8() and the performance density maximizing approach from k-bit inference scaling laws to achieve a quantization that replicates 16-bit performance with an average 4.6 bits per parameter. For finetuning, I will introduce the QLoRA algorithm, which pushes such quantization much further to unlock the finetuning of very large models on a single GPU by only updating a small set of the parameters while keeping most of the network in a new information-theoretically optimal 4-bit representation. For full training, I will present SWARM parallelism, which allows collaborative training of foundation models across continents on standard internet infrastructure while still being 80% as effective as the prohibitively expensive supercomputers that are currently used. Finally, I will close by outlining my plans to make future foundation models more accessible, which will be needed to maintain truly open AI-based scientific innovation as models continue to scale.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Dettmers_washington_0250E_26923.pdf
dc.identifier.uri	https://hdl.handle.net/1773/51870
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	deep learning
dc.subject	distributed pretraining
dc.subject	large language models
dc.subject	quantization
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Accessible Foundation Models: Systems, Algorithms, and Science
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dettmers_washington_0250E_26923.pdf
Size:: 1.8 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering