Polyglot Text Classification with Neural Document Models

Gururangan, Suchin

Polyglot Text Classification with Neural Document Models

Files

Gururangan_washington_0250O_19274.pdf (635.98 KB)

Date

2018-11-28

relationships.isAuthorOf

Gururangan, Suchin

Abstract

Sometimes, annotating data for text classification is expensive, so one must rely on techniques like parameter sharing and semi-supervised learning to improve classification performance in low-resource environments. In this thesis, I combine a generative, neural document model (Card et. al, 2018) and multilingual word vectors (Ammar et. al, 2016) to perform text classification on documents in eight languages. The model I propose jointly trains on labeled and unlabeled data from multiple languages, and incorporates additional document-level metadata, such as language ID, in its generative story. Through a series of experiments, I show that the model significantly outperforms monolingual baselines in low-resource environments.