Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

Jaja, Claire

Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages

dc.contributor.advisor	Xia, Fei	en_US
dc.contributor.author	Jaja, Claire	en_US
dc.date.accessioned	2015-02-24T17:36:29Z
dc.date.available	2015-02-24T17:36:29Z
dc.date.issued	2015-02-24
dc.date.submitted	2014	en_US
dc.description	Thesis (Master's)--University of Washington, 2014	en_US
dc.description.abstract	Dependency parsing is an important natural language processing (NLP) task with many downstream applications, and as is common in the field, high accuracy results can be obtained when using statistical methods and training on high-quality annotated training data. When dealing with low-resource languages where annotated training data is not available and prohibitively expensive to obtain, more clever methods must be used to leverage existing resources. My work in this thesis focuses on instance selection, which rests on the assumption, little explored cross-linguistically but well-proven monolingually in domain adaptation, that using less training data that is more relevant to your test case is better than using a full pool of potentially highly irrelevant training data. I conduct a larger, more thorough exploration than has previously been attempted into instance selection based on the perplexity of part-of-speech tag sequences, using the Google Universal Dependency Treebank, which spans ten languages. Additionally, I leverage another instance selection technique based on cross-entropy difference, which has shown superior results to perplexity selection when used for domain adaptation. These methods are both applied to two different potential pools of training data, one being the combination of multiple source languages, the other being English alone. Lastly, I explore automatic rearrangement of the part-of-speech tags in the English training data to better match three potential target languages. These experiments show mixed results, which may help to inform future exploration in dependency parsing for low-resource languages. When a pool of multiple source languages is used, a significant boost is seen for target languages where relevant training data is available but infrequent in the training data, with cross-entropy difference providing slightly better performance than perplexity selection. However, these methods don't provide the same large improvements for target languages where lots of relevant training data is available among the multiple source languages or when English alone is used as the training data. Rearranging the part-of-speech tags has a small positive impact on the scores when using the entire training dataset, which is promising for more extensive rearrangement. However, applying instance selection methods to select training data from this rearranged data does not yield better results than selecting training data from the non-rearranged data.	en_US
dc.embargo.terms	Open Access	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.other	Jaja_washington_0250O_13894.pdf	en_US
dc.identifier.uri	http://hdl.handle.net/1773/27514
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.subject	cross-entropy difference; dependency parsing; domain adaptation; instance selection; low-resource languages; perplexity	en_US
dc.subject.other	Computer science	en_US
dc.subject.other	Linguistics	en_US
dc.subject.other	linguistics	en_US
dc.title	Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Jaja_washington_0250O_13894.pdf
Size:: 545.91 KB
Format:: Adobe Portable Document Format

Download

Collections

Linguistics