Beyond Memorization: Evaluating Length-Generalization in Transformer-based Language Models

Steinert-Threlkeld, ShaneCheng, Yao-Fei2025-08-012025-08-012025-08-012025Cheng_washington_0250O_28270.pdfhttps://hdl.handle.net/1773/53677Thesis (Master's)--University of Washington, 2025Transformer-based large language models have made substantial progress in the NLP community. However, transformers have trouble with length generalization (i.e., extrapolating on different lengths than seen during training). Recently, (Zhou et al., 2024a) proposed the RASP-Generalization Conjecture to predict what tasks are length-generalizable using a few carefully handcrafted tasks in the mathematical domain. This work examines this conjecture by generating hundreds of synthetic tasks written in the shortest RASP-L programs. Our investigation does not support the conjecture. The tasks written in the shortest RASP-L programs are not length-generalizable. Furthermore, our analysis reveals that some unlength generalizable tasks are due to models not stopping generating. It can be easily fixed by using oracle length during evaluation as suggested in previous literature. Additionally, the analysis rejects induction head as the key factor for failure of length generalization as claimed in the previous findings. Despite our work not providing a precise explanation of transformers’ length generalization capability, we show that previous claims cannot extend to other tasks rather than carefully handcrafted tasks.application/pdfen-USCC BY-NCLarge Language ModelsLength-GeneralizationTransformersLinguisticsComputer scienceLinguisticsBeyond Memorization: Evaluating Length-Generalization in Transformer-based Language ModelsThesis