Extracting Knowledge from Twitter and The Web

Ritter, Alan L.

Extracting Knowledge from Twitter and The Web

dc.contributor.advisor	Etzioni, Oren	en_US
dc.contributor.author	Ritter, Alan L.	en_US
dc.date.accessioned	2013-11-14T20:53:04Z
dc.date.available	2013-11-14T20:53:04Z
dc.date.issued	2013-11-14
dc.date.submitted	2013	en_US
dc.description	Thesis (Ph.D.)--University of Washington, 2013	en_US
dc.description.abstract	The internet has revolutionized the way we communicate, leading to a constant flood of text in electronic format, including the Web, email, SMS and the short informal texts found in microblogs such as Twitter. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale. This thesis discusses the challenges and opportunities which arise when applying NLP and IE to large open-domain and heterogeneous text corpora such as Twitter and the Web, and presents solutions to a number of issues which arise in this setting. Good performance is achieved using a mostly supervised approach in cases where the number of output labels is small and well-balanced. We build a set of low-level syntactic annotation tools for noisy informal Twitter text including a POS tagger, shallow parser, named entity segmenter and event recognizer using supervised learning techniques trained on an annotated corpus of tweets. Supervised learning becomes impractical however, for semantic processing tasks such as: named entity categorization, event categorization, inferring selectional preferences and relation extraction, where the number of output labels is large and/or unknown a-priori. A key hypothesis which is evaluated throughout this thesis is that semantic processing of massive, diverse text corpora such as Twitter and the Web requires unsupervised and weakly supervised methods which can leverage large unlabeled datasets for learning, rather than relying on the relatively small corpora which are feasible to annotate. We present a set of techniques for unsupervised and weakly-supervised information extraction based on probabilistic latent variable models, which are applied to infer the semantics of large numbers of words and phrases and extract knowledge from large open-domain text corpora.	en_US
dc.embargo.terms	No embargo	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.other	Ritter_washington_0250E_11675.pdf	en_US
dc.identifier.uri	http://hdl.handle.net/1773/24134
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.subject	information extraction; latent variables; lexical semantics; machine learning; natural language processing; social computing	en_US
dc.subject.other	Computer science	en_US
dc.subject.other	Linguistics	en_US
dc.subject.other	computer science and engineering	en_US
dc.title	Extracting Knowledge from Twitter and The Web	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ritter_washington_0250E_11675.pdf
Size:: 1.77 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering