Learning Features for Text Classification

dc.contributor.advisorOstendorf, Marien_US
dc.contributor.authorZhang, Binen_US
dc.date.accessioned2013-07-25T16:26:00Z
dc.date.available2013-07-25T16:26:00Z
dc.date.issued2013-07-25
dc.date.submitted2013en_US
dc.descriptionThesis (Ph.D.)--University of Washington, 2013en_US
dc.description.abstractText classification is a general and important machine learning problem. For example, topic classification of text documents has been extensively studied for more than a decade, and simple word features are found to be very indicative of topics. Researchers have been focusing mostly on machine learning of classifiers instead of that of features. More recently, classification of sentiment, agreement and opinions in social media has drawn much attention, where individual word features are no longer sufficiently discriminative. Because good features are important to these tasks, engineering features becomes a crucial step in developing good text classification systems. However, feature engineering involves much manual work and is time-consuming. Another challenge to many text classification tasks is limited labeled training data. Large amounts of unlabeled data are available but they are often not used in supervised classifier training. A big issue related to features caused by limited labeled data is that only limited features are seen, and classifiers trained by supervised learning cannot use features that are unseen in training data. This thesis attempts to address both issues by applying machine learning to text features. A type of feature, i.e., phrase patterns, and the efficient algorithm to learn them from labeled training data, are proposed. Phrase pattern features are particularly useful for tasks involving modeling long-range complex behaviors as we see in social media data, and they are more flexible than n-gram features. The learned phrase patterns can contain both words and word classes, which improves generalizability. Significant performance improvements are observed in multiple conversational text classification tasks. This thesis also proposes feature affinity and cluster regularization, which uses feature relationships learned from unlabeled data to regularize training. This regularization scheme converts supervised learning text classification to semi-supervised, and it is simple to control the relative importance of the knowledge learned from unlabeled data. Using this method, features that are unseen in the labeled data get non-zero weights due to their relationships to seen features. These algorithms are evaluated in topic and sentiment classification tasks, achieving significant improvements. This thesis also studies the problem of learning feature relationships for high-dimensional features using a mixture co-occurrence model, making our approach applicable to complex task classification tasks where large amounts of unlabeled data are available. Multiple conversational text classification tasks are studied in our experiments, and significant performance improvements are demonstrated.en_US
dc.embargo.termsNo embargoen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherZhang_washington_0250E_11427.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/23330
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectfeature learning; high-dimensional features; natural language processing; phrase patterns; semi-supervised learning; text classificationen_US
dc.subject.otherElectrical engineeringen_US
dc.subject.otherComputer engineeringen_US
dc.subject.otherComputer scienceen_US
dc.subject.otherelectrical engineeringen_US
dc.titleLearning Features for Text Classificationen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhang_washington_0250E_11427.pdf
Size:
702.29 KB
Format:
Adobe Portable Document Format