Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing

Downey, C.M.

Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing

dc.contributor.advisor	Levow, Gina-Anne
dc.contributor.advisor	Steinert-Threlkeld, Shane
dc.contributor.author	Downey, C.M.
dc.date.accessioned	2024-09-09T23:12:02Z
dc.date.available	2024-09-09T23:12:02Z
dc.date.issued	2024-09-09
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Advances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of ``massively multilingual'' training on the other.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Downey_washington_0250E_27212.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52073
dc.language.iso	en_US
dc.rights	CC BY-SA
dc.subject	Linguistics
dc.subject	Computer science
dc.subject.other	Linguistics
dc.title	Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Downey_washington_0250E_27212.pdf
Size:: 5.49 MB
Format:: Adobe Portable Document Format

Download

Collections

Linguistics