Bridging the Gap: Adaptation Approaches for Under-Resourced Language Families
Abstract
Multilingual large language models have demonstrated remarkable success across a variety of natural language processing (NLP) tasks. However, their performance on low and under-resourced languages remains significantly limited, primarily due to disparities in data availability. This thesis investigates adaptation strategies to improve multilingual model performance on low-resource languages. Focusing on the Turkic language family, we investigate the effectiveness of adapting a pre-trained model using data from related languages. We examine the effectiveness of language-family-specific adaptation techniques, including language-adaptive pre-training (LAPT) and vocabulary specialization, and evaluate their impact on both zero-shot and few-shot scenarios. Our results highlight the potential of targeted multilingual adaptation to bridge performance gaps in low-resource settings and reinforce best practices for multilingual model adaptation.
Description
Thesis (Master's)--University of Washington, 2025
