Complementing Scale: Novel Guidance Methods for Improving Language Models

Press, Ofir

Complementing Scale: Novel Guidance Methods for Improving Language Models

Files

Press_washington_0250E_26025.pdf (1.65 MB)

Date

2023-09-27

relationships.isAuthorOf

Press, Ofir

Abstract

Language models (LMs) are at the core of almost all state of the art natural language processing systems. Recent papers, such as Brown et al. [2020] and Hoffmann et al. [2022] have shown that scaling up the size of these models leads to better results on the conventional benchmarks used by the community. But is scaling all we need in order to improve language models? Here, we show that some properties of LMs are not improved with scale. In addition, we show how to tackle these issues without actually increasing the size on disk, memory usage, or runtime of the LM. We accomplish this by adding a new kind of guidance to the model. In Shortformer, we show that increasing the train input sequence length in transformers doesn't always lead to better performance in terms of perplexity. We propose a new method that trains LMs on shorter sequences for the majority of training before briefly training on longer ones and show that this improves performance. Memory constraints imply that LMs have to be trained on limited segments of text. For example, GPT-3 [Brown et al. 2020] was trained on text segments that are 4,096 tokens long. Can these models summarize text sequences that are longer than the ones they observed at training? Can they make code predictions for code files that are longer than the ones they were shown during training? Here, we show that existing LMs cannot process text segments that are longer than the ones they were trained on. We present a new method (ALiBi) that allows LMs to efficiently consume sequences that are longer than the ones they observed at training. ALiBi achieves this by guiding the LM to pay less attention to words that are further away. Finally, we show that LMs are able to reason over facts observed during training to answer novel questions that they have never previously seen. But in about 40% of cases, they are not able to accomplish basic reasoning over facts that they are able to recall, and this does not improve with scale. We show that by adding guidance to the way we prompt LMs, by having them ask and answer sub-questions before answering the main complex question (as in our self-ask prompt), we are able to substantially improve their reasoning capabilities.