Interpretation and Validation for Unsupervised Learning

dc.contributor.advisorMeila-Predoviciu, Marina
dc.contributor.authorzhang, hanyu
dc.date.accessioned2023-04-17T18:05:17Z
dc.date.available2023-04-17T18:05:17Z
dc.date.issued2023-04-17
dc.date.issued2023-04-17
dc.date.issued2023-04-17
dc.date.submitted2023
dc.descriptionThesis (Ph.D.)--University of Washington, 2023
dc.description.abstractThis thesis studies two major problems in unsupervised learning: manifold learning and clustering. The motivation of this research is to establish mathematically rigorous methods that enable practitioners to have better understanding of what the algorithm is doing, even if there is no ground truth label for unsupervised learning problems. Specifically, we propose two criterion for a useful unsupervised learning paradigm: interpretability and stability. In the first part (chapter 2 - chapter 3), we propose a framework that allows domain experts to include a set of dictionary functions that can help provide manifold embedding coordinates with physical meaning. We first discuss mathematical foundation of this frameowrk. Based on this framework, we develop two algorithms. TSLasso obtains a manifold embedding function $\hat{\phi}$ that directly consists of functions from this dictionary as a valid parametrization of the data manifold. ManifoldLasso works with existing manifold embedding coordinates and outputs a subset of functions that parametrize the existing manifold embedding coordinates. In the second part of the thesis (chapter 4-chapter 6), we introduce the stability of clustering to quantitatively validate a clustering result so that it is possible for practitioners to avoid these unwanted phenomena. Our target is to establish a generic notion $(\gamma,\epsilon)-$stability and show how this can be applied to real statistical tasks. In chapter 5, we quantify population stability with respect to K-means clustering as a quantity for an arbitrary population $P$. With very mild assumptions on $P$, we show this quantity of $P$ relates to that of a finite sample drawn from $P$: if any optimal K-means clusterings of $P$ is not stable, then with high probability any global optimizers of K-means on $i.i.d.$ sample of $P$ is not stable; on the other hand, if population $P$ allows one stable clustering with low K-means loss, then global optimizers of K-means clustering on i.i.d. sample is with high probability stable. We develop an algorithm to compute an upper bound of stability metric with respect to K-means clustering. As a byproduct, it provides an upper bound on the discrepancy between the global optimal K-means clustering assignment with the computed ones. We also provide emprical validation of this method. In chapter 6, we focus on model-based clustering through fitting mixtures of spherical Gaussians (sGMM). Fitting sGMM is essentially a parameter estimation problem, and clustering assignments are based on the estimation. This thesis discusses mainly the parametric stability of sGMM: We show that if any two sGMMs are close, then their parameters are pairwise close. This result is proved with different assumptions on the model class of sGMMs. We can also see from numeric example that with the assumptions on the separation of different components in a Gaussian mixture, we obtain a precise upper bound on the parameter distances.}
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherzhang_washington_0250E_25258.pdf
dc.identifier.urihttp://hdl.handle.net/1773/49960
dc.language.isoen_US
dc.rightsnone
dc.subjectclustering
dc.subjectgaussian mixture models
dc.subjectmachine learning
dc.subjectmanifold learning
dc.subjectstability
dc.subjectunsupervised learning
dc.subjectStatistics
dc.subject.otherStatistics
dc.titleInterpretation and Validation for Unsupervised Learning
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
zhang_washington_0250E_25258.pdf
Size:
14.63 MB
Format:
Adobe Portable Document Format

Collections