Feature Extraction Using Topological Data Analysis for Machine Learning and Network Science Applications
Many real-world data sets can be viewed as a noisy sampling of an unknown high-dimensional topological space. The emergence and development of topological data analysis (TDA) over the last fifteen years or so provides a suite of tools to understand and exploit the topological structure of the underlying space from a multi-scale perspective that characterizes the shape of the data. This dissertation, thus, aims to leverage the shape information of data offered by the TDA tools to extract key features in machine learning and network science problems. We investigate a few TDA topics that are understudied following this line of research. We first extend the application of TDA to the manufacturing systems domain. We apply a widely used TDA method, known as the Mapper algorithm, on two benchmark data sets for chemical process yield prediction and semiconductor wafer fault detection. The algorithm yields topological networks that capture the intrinsic clusters and connections among the clusters (i.e., subgroups) present in the data sets, which are difficult to detect using traditional methods. Key process variables (features) that best differentiate the subgroups of interest are subsequently identified through statistical tests. Next we present a new method, referred as Sparse-TDA method, that integrates QR pivoting-based sparse sampling algorithm into vector-based TDA method to transform topological features into image pixels and identify discriminative pixel samples (features) in the presence of noisy and redundant information. We demonstrate its advantage over a state- of-the-art kernel TDA method and L1-regularized feature selection methods in terms of classification accuracy and training time on three challenging data sets pertaining to 3D meshes of synthetic and real human postures and textured images. Finally, we propose a method that extends the persistence-based TDA that is typically used for characterizing shapes to general networks. We introduce the concept of the community tree, a tree structure established based on clique communities from the clique percolation method, to summarize the topological structures in a network from a persistence perspective. Furthermore, we develop efficient algorithms to construct and update community trees by maintaining a series of clique graphs in the form of spanning forests, in which each spanning tree is built on an underlying Euler Tour tree. With the information revealed by community trees and the corresponding persistence diagrams, our proposed approach is able to detect clique communities and keep track of the major structural changes during their evolution given a stability threshold. The results demonstrate its effectiveness in extracting useful structural insights for time-varying social networks.