Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing

Gokcen, Ajda

Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing

dc.contributor.advisor	Levow, Gina-Anne
dc.contributor.author	Gokcen, Ajda
dc.date.accessioned	2022-01-26T23:25:27Z
dc.date.available	2022-01-26T23:25:27Z
dc.date.issued	2022-01-26
dc.date.submitted	2021
dc.description	Thesis (Ph.D.)--University of Washington, 2021
dc.description.abstract	Multilingual modeling comes up in natural language processing at any scale. High-resource language corpora train high-performing models, and can be combined with other language corpora of all sizes to make better models for low-resource languages. Projects like Universal Dependencies even make it possible to train highly multilingual models from standardized morphosyntactic labels. Multilingual (or, more generally, multi-source) training does not consistently improve modeling performance, however. With an abundance of language resources comes a difficult design choice: which corpora will train better together rather than separately? More specifically, when is it worthwhile to supplement (i.e., concatenate) one corpus with another during training, rather than training on the first corpus alone? Approaches to selecting and evaluating candidate combinations tend toward two extremes: ad hoc or exhaustive. In this work, I put forth an alternative, predictive methodology for outcomes of concatenative training in dependency parsing. I leverage treebanks from the Universal Dependencies framework to assess the utility of linguistic corpus metrics in multi-source modeling. This approach is both robust and practical, using computationally simple metrics that expand upon intuitions of linguistic similarity, and making it possible to reasonably predict which conditions will yield significant improvement for a target corpus. Although the results are specific to a particular family of models and the task of dependency parsing, the approach holds promise for any number of natural language processing applications.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Gokcen_washington_0250E_23778.pdf
dc.identifier.uri	http://hdl.handle.net/1773/48283
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Computational Linguistics
dc.subject	Corpus Linguistics
dc.subject	Dependency Parsing
dc.subject	Multilingual Modeling
dc.subject	Multitask Modeling
dc.subject	Natural Language Processing
dc.subject	Linguistics
dc.subject	Computer science
dc.subject.other	Linguistics
dc.title	Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gokcen_washington_0250E_23778.pdf
Size:: 1.19 MB
Format:: Adobe Portable Document Format

Download

Collections

Linguistics