Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing
| dc.contributor.advisor | Levow, Gina-Anne | |
| dc.contributor.advisor | Steinert-Threlkeld, Shane | |
| dc.contributor.author | Downey, C.M. | |
| dc.date.accessioned | 2024-09-09T23:12:02Z | |
| dc.date.available | 2024-09-09T23:12:02Z | |
| dc.date.issued | 2024-09-09 | |
| dc.date.submitted | 2024 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2024 | |
| dc.description.abstract | Advances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of ``massively multilingual'' training on the other. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Downey_washington_0250E_27212.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/52073 | |
| dc.language.iso | en_US | |
| dc.rights | CC BY-SA | |
| dc.subject | Linguistics | |
| dc.subject | Computer science | |
| dc.subject.other | Linguistics | |
| dc.title | Adapting Pre-Trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Downey_washington_0250E_27212.pdf
- Size:
- 5.49 MB
- Format:
- Adobe Portable Document Format
