Building the Next Generation of Multimodal Models

Ilharco, Gabriel

Building the Next Generation of Multimodal Models

Files

Ilharco_washington_0250E_26517.pdf (3.3 MB)

Date

2024-04-26

Authors

Ilharco, Gabriel

Abstract

One of the fundamental goals of machine learning is to create systems capable of processing data from a variety of modalities such as images and text. I argue that the next generation of multimodal models will be enabled by a deeper understanding of how to design pretraining datasets, and by techniques that offer better control over models after pretraining. Towards the first goal, I introduce a fully open-source benchmark for designing multimodal datasets. This benchmark provides a shared experimental setting for research on dataset curation, allowing researchers to conduct rigorous and controlled experiments. Our experiments highlight the potential of rigorous empirical work on dataset curation, finding pretraining datasets that outperform existing datasets by a large margin. Towards the second goal, I present multiple techniques for improving models after pretraining. Our fine-tuning techniques improve accuracy without overspecialization and without increasing inference costs. Moreover, I present a modular framework for steering the behavior of trained models, designed to efficiently add or delete capabilities while operating directly within the models’ weight space. Altogether, these new techniques pave the way for the next generation of multimodal models.