Building the Next Generation of Multimodal Models

Loading...
Thumbnail Image

Authors

Ilharco, Gabriel

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

One of the fundamental goals of machine learning is to create systems capable of processing data from a variety of modalities such as images and text. I argue that the next generation of multimodal models will be enabled by a deeper understanding of how to design pretraining datasets, and by techniques that offer better control over models after pretraining. Towards the first goal, I introduce a fully open-source benchmark for designing multimodal datasets. This benchmark provides a shared experimental setting for research on dataset curation, allowing researchers to conduct rigorous and controlled experiments. Our experiments highlight the potential of rigorous empirical work on dataset curation, finding pretraining datasets that outperform existing datasets by a large margin. Towards the second goal, I present multiple techniques for improving models after pretraining. Our fine-tuning techniques improve accuracy without overspecialization and without increasing inference costs. Moreover, I present a modular framework for steering the behavior of trained models, designed to efficiently add or delete capabilities while operating directly within the models’ weight space. Altogether, these new techniques pave the way for the next generation of multimodal models.

Description

Thesis (Ph.D.)--University of Washington, 2024

Citation

DOI