Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

dc.contributor.advisorXia, Feien_US
dc.contributor.authorJaja, Claireen_US
dc.date.accessioned2015-02-24T17:36:29Z
dc.date.available2015-02-24T17:36:29Z
dc.date.issued2015-02-24
dc.date.submitted2014en_US
dc.descriptionThesis (Master's)--University of Washington, 2014en_US
dc.description.abstractDependency parsing is an important natural language processing (NLP) task with many downstream applications, and as is common in the field, high accuracy results can be obtained when using statistical methods and training on high-quality annotated training data. When dealing with low-resource languages where annotated training data is not available and prohibitively expensive to obtain, more clever methods must be used to leverage existing resources. My work in this thesis focuses on instance selection, which rests on the assumption, little explored cross-linguistically but well-proven monolingually in domain adaptation, that using less training data that is more relevant to your test case is better than using a full pool of potentially highly irrelevant training data. I conduct a larger, more thorough exploration than has previously been attempted into instance selection based on the perplexity of part-of-speech tag sequences, using the Google Universal Dependency Treebank, which spans ten languages. Additionally, I leverage another instance selection technique based on cross-entropy difference, which has shown superior results to perplexity selection when used for domain adaptation. These methods are both applied to two different potential pools of training data, one being the combination of multiple source languages, the other being English alone. Lastly, I explore automatic rearrangement of the part-of-speech tags in the English training data to better match three potential target languages. These experiments show mixed results, which may help to inform future exploration in dependency parsing for low-resource languages. When a pool of multiple source languages is used, a significant boost is seen for target languages where relevant training data is available but infrequent in the training data, with cross-entropy difference providing slightly better performance than perplexity selection. However, these methods don't provide the same large improvements for target languages where lots of relevant training data is available among the multiple source languages or when English alone is used as the training data. Rearranging the part-of-speech tags has a small positive impact on the scores when using the entire training dataset, which is promising for more extensive rearrangement. However, applying instance selection methods to select training data from this rearranged data does not yield better results than selecting training data from the non-rearranged data.en_US
dc.embargo.termsOpen Accessen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherJaja_washington_0250O_13894.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/27514
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectcross-entropy difference; dependency parsing; domain adaptation; instance selection; low-resource languages; perplexityen_US
dc.subject.otherComputer scienceen_US
dc.subject.otherLinguisticsen_US
dc.subject.otherlinguisticsen_US
dc.titleLeveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languagesen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Jaja_washington_0250O_13894.pdf
Size:
545.91 KB
Format:
Adobe Portable Document Format

Collections