Extracting Knowledge from Twitter and The Web

dc.contributor.advisorEtzioni, Orenen_US
dc.contributor.authorRitter, Alan L.en_US
dc.date.accessioned2013-11-14T20:53:04Z
dc.date.available2013-11-14T20:53:04Z
dc.date.issued2013-11-14
dc.date.submitted2013en_US
dc.descriptionThesis (Ph.D.)--University of Washington, 2013en_US
dc.description.abstractThe internet has revolutionized the way we communicate, leading to a constant flood of text in electronic format, including the Web, email, SMS and the short informal texts found in microblogs such as Twitter. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale. This thesis discusses the challenges and opportunities which arise when applying NLP and IE to large open-domain and heterogeneous text corpora such as Twitter and the Web, and presents solutions to a number of issues which arise in this setting. Good performance is achieved using a mostly supervised approach in cases where the number of output labels is small and well-balanced. We build a set of low-level syntactic annotation tools for noisy informal Twitter text including a POS tagger, shallow parser, named entity segmenter and event recognizer using supervised learning techniques trained on an annotated corpus of tweets. Supervised learning becomes impractical however, for semantic processing tasks such as: named entity categorization, event categorization, inferring selectional preferences and relation extraction, where the number of output labels is large and/or unknown a-priori. A key hypothesis which is evaluated throughout this thesis is that semantic processing of massive, diverse text corpora such as Twitter and the Web requires unsupervised and weakly supervised methods which can leverage large unlabeled datasets for learning, rather than relying on the relatively small corpora which are feasible to annotate. We present a set of techniques for unsupervised and weakly-supervised information extraction based on probabilistic latent variable models, which are applied to infer the semantics of large numbers of words and phrases and extract knowledge from large open-domain text corpora.en_US
dc.embargo.termsNo embargoen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherRitter_washington_0250E_11675.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/24134
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectinformation extraction; latent variables; lexical semantics; machine learning; natural language processing; social computingen_US
dc.subject.otherComputer scienceen_US
dc.subject.otherLinguisticsen_US
dc.subject.othercomputer science and engineeringen_US
dc.titleExtracting Knowledge from Twitter and The Weben_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ritter_washington_0250E_11675.pdf
Size:
1.77 MB
Format:
Adobe Portable Document Format