Scalable Manifold Learning and Related Topics

McQueen, James

Scalable Manifold Learning and Related Topics

dc.contributor.advisor	Meila, Marina
dc.contributor.author	McQueen, James
dc.date.accessioned	2017-08-11T23:01:17Z
dc.date.available	2017-08-11T23:01:17Z
dc.date.issued	2017-08-11
dc.date.issued	2017-08-11
dc.date.issued	2017-08-11
dc.date.submitted	2017-06
dc.description	Thesis (Ph.D.)--University of Washington, 2017-06
dc.description.abstract	The subject of manifold learning is vast and still largely unexplored. As a subset of unsupervised learning it has a fundamental challenge in adequately defining the problem but whose solution is to an increasingly important desire to understand data sets intrinsically. It is the overarching goal of this work to present researchers with an understanding of the topic of manifold learning, with a description and proposed method for performing manifold learning, guidance for selecting parameters when applying manifold learning to large scientific data sets and together with open source software powerful enough to meet the demands of big data. First we describe the topic of manifold learning in the context of manifolds and Riemannian metrics. We use this framework to define a loss function which encodes deviation from an isometry into Euclidean space. The loss function we define explicitly handles the case where the embedding dimension is larger than the intrinsic dimension. By doing so it ensures that the resulting embedding will still be a submanifold of the original intrinsic dimension, this is a significant departure from previous methods. Due to the iterative nature of the algorithm RiemannianRelaxation it naturally scales to large data sets. Second we provide a cohesive overview of several heuristics for selecting parameters when performing manifold learning, the bandwidth and dimension parameters. We demonstrate how to scale these methods to deal with the demands of big scientific data sets and apply these methods to a large astronomical database of galaxy spectra. Third, we combine all of these methods into an open source python package, megaman, which is explicitly designed with the challenges of research in mind: dealing with large data sets, selecting parameters, repeating procedures and storing interim steps. With these together it is our hope to provide scientists and researchers alike with the tools to apply manifold learning. As an additional, related topic, we discuss spectral clustering. We overview the subject and how it relates to manifold learning as well as propose a scalable (online) pre-processing algorithm for pruning the graph before performing spectral clustering. We then demonstrate this using a large 1.8M paper citation network from Jstor.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	McQueen_washington_0250E_17172.pdf
dc.identifier.uri	http://hdl.handle.net/1773/40305
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Clustering
dc.subject	Machine Learning
dc.subject	Manifold Learning
dc.subject	Non-Linear Dimension Reduction
dc.subject	Unsupervised Learning
dc.subject	Statistics
dc.subject.other	Statistics
dc.title	Scalable Manifold Learning and Related Topics
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: McQueen_washington_0250E_17172.pdf
Size:: 2.9 MB
Format:: Adobe Portable Document Format

Download

Collections

Statistics