Learning Features for Text Classification

Zhang, Bin

Learning Features for Text Classification

dc.contributor.advisor	Ostendorf, Mari	en_US
dc.contributor.author	Zhang, Bin	en_US
dc.date.accessioned	2013-07-25T16:26:00Z
dc.date.available	2013-07-25T16:26:00Z
dc.date.issued	2013-07-25
dc.date.submitted	2013	en_US
dc.description	Thesis (Ph.D.)--University of Washington, 2013	en_US
dc.description.abstract	Text classification is a general and important machine learning problem. For example, topic classification of text documents has been extensively studied for more than a decade, and simple word features are found to be very indicative of topics. Researchers have been focusing mostly on machine learning of classifiers instead of that of features. More recently, classification of sentiment, agreement and opinions in social media has drawn much attention, where individual word features are no longer sufficiently discriminative. Because good features are important to these tasks, engineering features becomes a crucial step in developing good text classification systems. However, feature engineering involves much manual work and is time-consuming. Another challenge to many text classification tasks is limited labeled training data. Large amounts of unlabeled data are available but they are often not used in supervised classifier training. A big issue related to features caused by limited labeled data is that only limited features are seen, and classifiers trained by supervised learning cannot use features that are unseen in training data. This thesis attempts to address both issues by applying machine learning to text features. A type of feature, i.e., phrase patterns, and the efficient algorithm to learn them from labeled training data, are proposed. Phrase pattern features are particularly useful for tasks involving modeling long-range complex behaviors as we see in social media data, and they are more flexible than n-gram features. The learned phrase patterns can contain both words and word classes, which improves generalizability. Significant performance improvements are observed in multiple conversational text classification tasks. This thesis also proposes feature affinity and cluster regularization, which uses feature relationships learned from unlabeled data to regularize training. This regularization scheme converts supervised learning text classification to semi-supervised, and it is simple to control the relative importance of the knowledge learned from unlabeled data. Using this method, features that are unseen in the labeled data get non-zero weights due to their relationships to seen features. These algorithms are evaluated in topic and sentiment classification tasks, achieving significant improvements. This thesis also studies the problem of learning feature relationships for high-dimensional features using a mixture co-occurrence model, making our approach applicable to complex task classification tasks where large amounts of unlabeled data are available. Multiple conversational text classification tasks are studied in our experiments, and significant performance improvements are demonstrated.	en_US
dc.embargo.terms	No embargo	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.other	Zhang_washington_0250E_11427.pdf	en_US
dc.identifier.uri	http://hdl.handle.net/1773/23330
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.subject	feature learning; high-dimensional features; natural language processing; phrase patterns; semi-supervised learning; text classification	en_US
dc.subject.other	Electrical engineering	en_US
dc.subject.other	Computer engineering	en_US
dc.subject.other	Computer science	en_US
dc.subject.other	electrical engineering	en_US
dc.title	Learning Features for Text Classification	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Zhang_washington_0250E_11427.pdf
Size:: 702.29 KB
Format:: Adobe Portable Document Format

Download

Collections

Electrical engineering