Learning Features for Text Classification
MetadataShow full item record
Text classification is a general and important machine learning problem. For example, topic classification of text documents has been extensively studied for more than a decade, and simple word features are found to be very indicative of topics. Researchers have been focusing mostly on machine learning of classifiers instead of that of features. More recently, classification of sentiment, agreement and opinions in social media has drawn much attention, where individual word features are no longer sufficiently discriminative. Because good features are important to these tasks, engineering features becomes a crucial step in developing good text classification systems. However, feature engineering involves much manual work and is time-consuming. Another challenge to many text classification tasks is limited labeled training data. Large amounts of unlabeled data are available but they are often not used in supervised classifier training. A big issue related to features caused by limited labeled data is that only limited features are seen, and classifiers trained by supervised learning cannot use features that are unseen in training data. This thesis attempts to address both issues by applying machine learning to text features. A type of feature, i.e., phrase patterns, and the efficient algorithm to learn them from labeled training data, are proposed. Phrase pattern features are particularly useful for tasks involving modeling long-range complex behaviors as we see in social media data, and they are more flexible than n-gram features. The learned phrase patterns can contain both words and word classes, which improves generalizability. Significant performance improvements are observed in multiple conversational text classification tasks. This thesis also proposes feature affinity and cluster regularization, which uses feature relationships learned from unlabeled data to regularize training. This regularization scheme converts supervised learning text classification to semi-supervised, and it is simple to control the relative importance of the knowledge learned from unlabeled data. Using this method, features that are unseen in the labeled data get non-zero weights due to their relationships to seen features. These algorithms are evaluated in topic and sentiment classification tasks, achieving significant improvements. This thesis also studies the problem of learning feature relationships for high-dimensional features using a mixture co-occurrence model, making our approach applicable to complex task classification tasks where large amounts of unlabeled data are available. Multiple conversational text classification tasks are studied in our experiments, and significant performance improvements are demonstrated.
- Electrical engineering