Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-resource Languages

Cadigan, JohnParallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-resource LanguagesMy University2019BUCCCNNcomparable corporalow resourceparallel sentenceResNetArtificial intelligenceLinguisticsComputer scienceLinguisticsMy UniversityMy UniversityMarton, Yuval2019-05-022019-05-022019-05-022018en-USThesisCadigan_washington_0250O_19431.pdfhttp://hdl.handle.net/1773/43705application/pdfCC BYThesis (Master's)--University of Washington, 2018In an emergency, machine translation systems can be useful in facilitating international cooperation during rescue efforts. Unfortunately, training resources (bitexts) for many language pairs are scarce and tend to comprise a limited vocabulary which will result in low quality translation. Finding parallel sentences in comparable corpora is a solution to quickly and cheaply augment bitext, expanding the vocabulary of the translation model. Unfortunately, such out-of-vocabulary words are also a problem for parallel sentence detection in comparable corpora; with features derived from the existing translation models, this makes the task of detecting parallel sentences more challenging for new domains with unseen vocabulary. This is because non-translations and unseen translations are hard to distinguish. Bilingual word embeddings have been recognized as a solution to this problem because they allow the use of the more plentiful monolingual text to derive a representation for words which approximates a translation model. This study quantifies the role of contemporary bilingual word embedding methods in extracting parallel sentences from comparable corpora in low-resource settings. The first dataset is a simulated low-resource dataset of Chinese-English from the BUCC 2017-2018 comparable corpus shared task. Using the methods established by the BUCC shared task, a second synthetic comparable corpus is created with a true low-resource pair composed of data used during an emergency: Haitian Creole and English. With limited resources, languages on both sides of the comparable corpora are representative of low-resource settings. The embedding methods are first evaluated in bilingual lexicon induction, and the best performing ones are applied in downstream tasks. First, they are used in filtering the tremendous amount of candidate parallel sentences in the comparable corpora. Secondly, they are used during the classification of those candidate pairs as parallel or non-parallel. Key contributions are as follows. First, the performance advantage of character-based embeddings over other word-based monolingual embedding methods for bilingual lexicon induction is confirmed--particularly for rare words; these words come from the top 30% of the corpora appearing in this study which is approximately 15 times the (percentile) range of vocabulary examined in other studies' experiments and their respective corpora. Second, classic and new methods to filter candidate pairs are compared and quantified on the same datasets for the first time--as best known to me. A retrieval rate of 95% is possible even in low resource settings. The addition of bilingual word embeddings in candidate filtering did not yield gains in retrieval rate, but it did improve results during classification. For classification, a novel architecture for parallel sentence detection is presented: an extensible 2D residual convolutional neural network; compared to previous Siamese RNN architectures in this task, it effectively incorporates features derived from monolingual data. With an optimized cutoff, the ResNet can be considered in near competition with the best systems from the 2018 edition of the BUCC shared task on the Chinese-English dataset while using less data. Comparisons with a MaxEnt model indicate that the 2D ResNet's explicitly syntactic representation may be a leveraging factor for limited resources. Besides novelty in models, a new matching heuristic is applied to the results of classifiers; it consistently exchanges a small amount of recall for gains in precision for an overall increase in F1 score. The positive results for the Haitian-Creole and English dataset which is truly representative of what is available for low-resource languages during an emergency provide initial evidence that the methods may also be effective for other low-resource languages.