Scalable Manifold Learning and Related Topics

dc.contributor.advisorMeila, Marina
dc.contributor.authorMcQueen, James
dc.date.accessioned2017-08-11T23:01:17Z
dc.date.available2017-08-11T23:01:17Z
dc.date.issued2017-08-11
dc.date.issued2017-08-11
dc.date.issued2017-08-11
dc.date.submitted2017-06
dc.descriptionThesis (Ph.D.)--University of Washington, 2017-06
dc.description.abstractThe subject of manifold learning is vast and still largely unexplored. As a subset of unsupervised learning it has a fundamental challenge in adequately defining the problem but whose solution is to an increasingly important desire to understand data sets intrinsically. It is the overarching goal of this work to present researchers with an understanding of the topic of manifold learning, with a description and proposed method for performing manifold learning, guidance for selecting parameters when applying manifold learning to large scientific data sets and together with open source software powerful enough to meet the demands of big data. First we describe the topic of manifold learning in the context of manifolds and Riemannian metrics. We use this framework to define a loss function which encodes deviation from an isometry into Euclidean space. The loss function we define explicitly handles the case where the embedding dimension is larger than the intrinsic dimension. By doing so it ensures that the resulting embedding will still be a submanifold of the original intrinsic dimension, this is a significant departure from previous methods. Due to the iterative nature of the algorithm RiemannianRelaxation it naturally scales to large data sets. Second we provide a cohesive overview of several heuristics for selecting parameters when performing manifold learning, the bandwidth and dimension parameters. We demonstrate how to scale these methods to deal with the demands of big scientific data sets and apply these methods to a large astronomical database of galaxy spectra. Third, we combine all of these methods into an open source python package, megaman, which is explicitly designed with the challenges of research in mind: dealing with large data sets, selecting parameters, repeating procedures and storing interim steps. With these together it is our hope to provide scientists and researchers alike with the tools to apply manifold learning. As an additional, related topic, we discuss spectral clustering. We overview the subject and how it relates to manifold learning as well as propose a scalable (online) pre-processing algorithm for pruning the graph before performing spectral clustering. We then demonstrate this using a large 1.8M paper citation network from Jstor.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherMcQueen_washington_0250E_17172.pdf
dc.identifier.urihttp://hdl.handle.net/1773/40305
dc.language.isoen_US
dc.rightsCC BY
dc.subjectClustering
dc.subjectMachine Learning
dc.subjectManifold Learning
dc.subjectNon-Linear Dimension Reduction
dc.subjectUnsupervised Learning
dc.subjectStatistics
dc.subject.otherStatistics
dc.titleScalable Manifold Learning and Related Topics
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
McQueen_washington_0250E_17172.pdf
Size:
2.9 MB
Format:
Adobe Portable Document Format

Collections