Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing

dc.contributor.advisorLevow, Gina-Anne
dc.contributor.authorGokcen, Ajda
dc.date.accessioned2022-01-26T23:25:27Z
dc.date.available2022-01-26T23:25:27Z
dc.date.issued2022-01-26
dc.date.submitted2021
dc.descriptionThesis (Ph.D.)--University of Washington, 2021
dc.description.abstractMultilingual modeling comes up in natural language processing at any scale. High-resource language corpora train high-performing models, and can be combined with other language corpora of all sizes to make better models for low-resource languages. Projects like Universal Dependencies even make it possible to train highly multilingual models from standardized morphosyntactic labels. Multilingual (or, more generally, multi-source) training does not consistently improve modeling performance, however. With an abundance of language resources comes a difficult design choice: which corpora will train better together rather than separately? More specifically, when is it worthwhile to supplement (i.e., concatenate) one corpus with another during training, rather than training on the first corpus alone? Approaches to selecting and evaluating candidate combinations tend toward two extremes: ad hoc or exhaustive. In this work, I put forth an alternative, predictive methodology for outcomes of concatenative training in dependency parsing. I leverage treebanks from the Universal Dependencies framework to assess the utility of linguistic corpus metrics in multi-source modeling. This approach is both robust and practical, using computationally simple metrics that expand upon intuitions of linguistic similarity, and making it possible to reasonably predict which conditions will yield significant improvement for a target corpus. Although the results are specific to a particular family of models and the task of dependency parsing, the approach holds promise for any number of natural language processing applications.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherGokcen_washington_0250E_23778.pdf
dc.identifier.urihttp://hdl.handle.net/1773/48283
dc.language.isoen_US
dc.rightsCC BY
dc.subjectComputational Linguistics
dc.subjectCorpus Linguistics
dc.subjectDependency Parsing
dc.subjectMultilingual Modeling
dc.subjectMultitask Modeling
dc.subjectNatural Language Processing
dc.subjectLinguistics
dc.subjectComputer science
dc.subject.otherLinguistics
dc.titleResourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Gokcen_washington_0250E_23778.pdf
Size:
1.19 MB
Format:
Adobe Portable Document Format

Collections