Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing

Levow, Gina-AnneSteinert-Threlkeld, ShaneDowney, C.M.2024-09-092024-09-092024-09-092024Downey_washington_0250E_27212.pdfhttps://hdl.handle.net/1773/52073Thesis (Ph.D.)--University of Washington, 2024Advances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of ``massively multilingual'' training on the other.application/pdfen-USCC BY-SALinguisticsComputer scienceLinguisticsAdapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language ProcessingThesis