Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing

dc.contributor.advisorLevow, Gina-Anne
dc.contributor.advisorSteinert-Threlkeld, Shane
dc.contributor.authorDowney, C.M.
dc.date.accessioned2024-09-09T23:12:02Z
dc.date.available2024-09-09T23:12:02Z
dc.date.issued2024-09-09
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractAdvances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of ``massively multilingual'' training on the other.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherDowney_washington_0250E_27212.pdf
dc.identifier.urihttps://hdl.handle.net/1773/52073
dc.language.isoen_US
dc.rightsCC BY-SA
dc.subjectLinguistics
dc.subjectComputer science
dc.subject.otherLinguistics
dc.titleAdapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Downey_washington_0250E_27212.pdf
Size:
5.49 MB
Format:
Adobe Portable Document Format

Collections