Dependency Parsing for Tweets
This thesis concentrates on the problem of dependency parsing for Twitter texts. Twitter texts, also called tweets, are a typical kind of web domain language with many informal and specific linguistic phenomena (Eisenstein, 2013), which is drawing more attention in NLP research. Although parsing algorithms have achieved huge progress in newswire text data in recent years, it is hard for parsers directly trained on them to achieve comparable results in tweets (Foster et al., 2011a). Therefore, we try to tackle this problem in two aspects, data and model. In the first aspect, we discuss the Twitter specific linguistic phenomena that could cause challenges for creating tweet dependencies, and take them into account within our annotation formalisms. We create a new development set with 210 tweets for the first tweet dependency treebank, Tweebank (Kong et al., 2014). In the second aspect, we propose neural tweet parser, a novel neural dependency parser for tweets. We extend the stack LSTM parser (Dyer et al., 2015) and incorporate character embeddings (Ballesteros et al., 2015) into our word representations. We further explore both out-of-domain data by presenting a cascading model using pre-training and unannotated in-domain data using tri-training to increase the scale of the training data. Experimental results show that our neural tweet parser is over 15 times faster than Tweeboparser (Kong et al., 2014), the previous state-of-the-art parser for tweets. Our parser also benefits from both types of external data, and with tri-training data, our parser outperforms Tweeboparser.
- Linguistics