Interpretation and Validation for Unsupervised Learning

zhang, hanyu

Interpretation and Validation for Unsupervised Learning

dc.contributor.advisor	Meila-Predoviciu, Marina
dc.contributor.author	zhang, hanyu
dc.date.accessioned	2023-04-17T18:05:17Z
dc.date.available	2023-04-17T18:05:17Z
dc.date.issued	2023-04-17
dc.date.issued	2023-04-17
dc.date.issued	2023-04-17
dc.date.submitted	2023
dc.description	Thesis (Ph.D.)--University of Washington, 2023
dc.description.abstract	This thesis studies two major problems in unsupervised learning: manifold learning and clustering. The motivation of this research is to establish mathematically rigorous methods that enable practitioners to have better understanding of what the algorithm is doing, even if there is no ground truth label for unsupervised learning problems. Specifically, we propose two criterion for a useful unsupervised learning paradigm: interpretability and stability. In the first part (chapter 2 - chapter 3), we propose a framework that allows domain experts to include a set of dictionary functions that can help provide manifold embedding coordinates with physical meaning. We first discuss mathematical foundation of this frameowrk. Based on this framework, we develop two algorithms. TSLasso obtains a manifold embedding function $\hat{\phi}$ that directly consists of functions from this dictionary as a valid parametrization of the data manifold. ManifoldLasso works with existing manifold embedding coordinates and outputs a subset of functions that parametrize the existing manifold embedding coordinates. In the second part of the thesis (chapter 4-chapter 6), we introduce the stability of clustering to quantitatively validate a clustering result so that it is possible for practitioners to avoid these unwanted phenomena. Our target is to establish a generic notion $(\gamma,\epsilon)-$stability and show how this can be applied to real statistical tasks. In chapter 5, we quantify population stability with respect to K-means clustering as a quantity for an arbitrary population $P$. With very mild assumptions on $P$, we show this quantity of $P$ relates to that of a finite sample drawn from $P$: if any optimal K-means clusterings of $P$ is not stable, then with high probability any global optimizers of K-means on $i.i.d.$ sample of $P$ is not stable; on the other hand, if population $P$ allows one stable clustering with low K-means loss, then global optimizers of K-means clustering on i.i.d. sample is with high probability stable. We develop an algorithm to compute an upper bound of stability metric with respect to K-means clustering. As a byproduct, it provides an upper bound on the discrepancy between the global optimal K-means clustering assignment with the computed ones. We also provide emprical validation of this method. In chapter 6, we focus on model-based clustering through fitting mixtures of spherical Gaussians (sGMM). Fitting sGMM is essentially a parameter estimation problem, and clustering assignments are based on the estimation. This thesis discusses mainly the parametric stability of sGMM: We show that if any two sGMMs are close, then their parameters are pairwise close. This result is proved with different assumptions on the model class of sGMMs. We can also see from numeric example that with the assumptions on the separation of different components in a Gaussian mixture, we obtain a precise upper bound on the parameter distances.}
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	zhang_washington_0250E_25258.pdf
dc.identifier.uri	http://hdl.handle.net/1773/49960
dc.language.iso	en_US
dc.rights	none
dc.subject	clustering
dc.subject	gaussian mixture models
dc.subject	machine learning
dc.subject	manifold learning
dc.subject	stability
dc.subject	unsupervised learning
dc.subject	Statistics
dc.subject.other	Statistics
dc.title	Interpretation and Validation for Unsupervised Learning
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: zhang_washington_0250E_25258.pdf
Size:: 14.63 MB
Format:: Adobe Portable Document Format

Download

Collections

Statistics