Extracting Knowledge from Twitter and The Web
Ritter, Alan L.
MetadataShow full item record
The internet has revolutionized the way we communicate, leading to a constant flood of text in electronic format, including the Web, email, SMS and the short informal texts found in microblogs such as Twitter. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale. This thesis discusses the challenges and opportunities which arise when applying NLP and IE to large open-domain and heterogeneous text corpora such as Twitter and the Web, and presents solutions to a number of issues which arise in this setting. Good performance is achieved using a mostly supervised approach in cases where the number of output labels is small and well-balanced. We build a set of low-level syntactic annotation tools for noisy informal Twitter text including a POS tagger, shallow parser, named entity segmenter and event recognizer using supervised learning techniques trained on an annotated corpus of tweets. Supervised learning becomes impractical however, for semantic processing tasks such as: named entity categorization, event categorization, inferring selectional preferences and relation extraction, where the number of output labels is large and/or unknown a-priori. A key hypothesis which is evaluated throughout this thesis is that semantic processing of massive, diverse text corpora such as Twitter and the Web requires unsupervised and weakly supervised methods which can leverage large unlabeled datasets for learning, rather than relying on the relatively small corpora which are feasible to annotate. We present a set of techniques for unsupervised and weakly-supervised information extraction based on probabilistic latent variable models, which are applied to infer the semantics of large numbers of words and phrases and extract knowledge from large open-domain text corpora.