Scalable Manifold Learning and Related Topics
Author
McQueen, James
Metadata
Show full item recordAbstract
The subject of manifold learning is vast and still largely unexplored. As a subset of unsupervised learning it has a fundamental challenge in adequately defining the problem but whose solution is to an increasingly important desire to understand data sets intrinsically. It is the overarching goal of this work to present researchers with an understanding of the topic of manifold learning, with a description and proposed method for performing manifold learning, guidance for selecting parameters when applying manifold learning to large scientific data sets and together with open source software powerful enough to meet the demands of big data. First we describe the topic of manifold learning in the context of manifolds and Riemannian metrics. We use this framework to define a loss function which encodes deviation from an isometry into Euclidean space. The loss function we define explicitly handles the case where the embedding dimension is larger than the intrinsic dimension. By doing so it ensures that the resulting embedding will still be a submanifold of the original intrinsic dimension, this is a significant departure from previous methods. Due to the iterative nature of the algorithm RiemannianRelaxation it naturally scales to large data sets. Second we provide a cohesive overview of several heuristics for selecting parameters when performing manifold learning, the bandwidth and dimension parameters. We demonstrate how to scale these methods to deal with the demands of big scientific data sets and apply these methods to a large astronomical database of galaxy spectra. Third, we combine all of these methods into an open source python package, megaman, which is explicitly designed with the challenges of research in mind: dealing with large data sets, selecting parameters, repeating procedures and storing interim steps. With these together it is our hope to provide scientists and researchers alike with the tools to apply manifold learning. As an additional, related topic, we discuss spectral clustering. We overview the subject and how it relates to manifold learning as well as propose a scalable (online) pre-processing algorithm for pruning the graph before performing spectral clustering. We then demonstrate this using a large 1.8M paper citation network from Jstor.
Collections
- Statistics [99]