Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing

Gokcen, Ajda

Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing

Files

Gokcen_washington_0250E_23778.pdf (1.19 MB)

Date

2022-01-26

relationships.isAuthorOf

Gokcen, Ajda

Abstract

Multilingual modeling comes up in natural language processing at any scale. High-resource language corpora train high-performing models, and can be combined with other language corpora of all sizes to make better models for low-resource languages. Projects like Universal Dependencies even make it possible to train highly multilingual models from standardized morphosyntactic labels. Multilingual (or, more generally, multi-source) training does not consistently improve modeling performance, however. With an abundance of language resources comes a difficult design choice: which corpora will train better together rather than separately? More specifically, when is it worthwhile to supplement (i.e., concatenate) one corpus with another during training, rather than training on the first corpus alone? Approaches to selecting and evaluating candidate combinations tend toward two extremes: ad hoc or exhaustive. In this work, I put forth an alternative, predictive methodology for outcomes of concatenative training in dependency parsing. I leverage treebanks from the Universal Dependencies framework to assess the utility of linguistic corpus metrics in multi-source modeling. This approach is both robust and practical, using computationally simple metrics that expand upon intuitions of linguistic similarity, and making it possible to reasonably predict which conditions will yield significant improvement for a target corpus. Although the results are specific to a particular family of models and the task of dependency parsing, the approach holds promise for any number of natural language processing applications.