How to train your self-supervised NLP model: Investigating pre-training objectives, data, and scale

Joshi, Mandar

How to train your self-supervised NLP model: Investigating pre-training objectives, data, and scale

Files

Joshi_washington_0250E_23888.pdf (4.07 MB)

Date

2022-04-19

Authors

Joshi, Mandar

Abstract

A robust language processing machine should be able to encode linguistic and factual knowledge across a wide variety of domains, languages, and even modalities. The paradigm of pre-training self-supervised models on large text corpora has driven much of recent progress towards this goal. In spite of this large scale pre-training, the best performing models have to be further fine-tuned on downstream tasks -- often containing hundreds of thousands of examples -- to achieve state of the art performance. The aim of this thesis is twofold: (a) to design efficient scalable pre-training methods which capture different kinds of linguistic and world knowledge, and (b) to enable better downstream performance with fewer human-labeled examples. The first part of the thesis focuses on self-supervised objectives for reasoning about relationships between pairs of words. In NLI, for example, given the premise "golf is prohibitively expensive", inferring that the hypothesis "golf is a cheap pastime" is a contradiction requires one to know that expensive and cheap are antonyms. We show that with the right kind of self-supervised objectives, such knowledge learned with word pair vectors (pair2vec) directly from text without using curated knowledge bases and ontologies. The second part of the thesis seeks to build models which encode knowledge beyond word pair relations into model parameters. We present SpanBERT, a scalable pre-training method that is designed to better represent and predict spans of text. Span-based pre-training objectives seek to efficiently encode a wider variety of knowledge, and improve the state of the art for a range of NLP tasks. The third part of the thesis focuses integrating dynamically retrieved textual knowledge. Specifically, even large scale representations are not able to preserve all factual knowledge they have "read'" during pre-training due to the long tail of entity and event-specific information. We show that training models to integrate background knowledge during pre-training is especially useful for downstream tasks which require reasoning over this long tail. The last part of the thesis targets a major weakness of self-supervised models -- while such models requires no explicit human supervision during pre-training, they still need lots of human-labeled downstream task data. We seek to remedy this by mining input-output pairs (and thus obtaining direct task-level supervision) from corpora using supervision from very few labeled examples. Overall, this thesis presents a range of ideas required for effective pre-training and fine-tuning -- (a) self-supervised objectives, (b) model scale, and (c) new types of data. As we will show in the following chapters, self-supervised objectives could have a large influence on the form of knowledge that is acquired during pre-training. Moreover, efficient objectives directly enable model scale both in terms of data and parameters. Finally, the training data and the kind of supervision derived from it itself dictates how well a model can learn different kinds of downstream tasks.