dc.description.abstract | High-dimensional data sets often have lower-dimensional structure taking the form of a submanifold of a Euclidean space. It is challenging but necessary to develop statistical methods for these data sets that respect the manifold structure. We present research from two areas: manifold learning (i.e., support estimation) and smooth regression on manifolds. First, we consider the problem of recovering a $d$-dimensional submanifold $\M$ of $\R^n$ when provided with $N$ samples from $\M$. Ideally, the estimator of $\M$ should be a manifold of a certain smoothness, and it should converge to $\M$ in Hausdorff distance as $N$ increases. Fefferman, Mitter, and Narayanan (2016) have developed an algorithm whose output is provably a manifold. The algorithm relies on the definition of an approximate squared-distance function (asdf) to $\M$. As long as the asdf meets certain regularity conditions (which can be difficult to verify), it can be used to define an estimator of $\M$ that has the desired properties. We define two asdfs that can be calculated solely from the sample and show that they meet the required regularity conditions. The first asdf is based on kernel density estimation, and the second is based on the estimation of tangent spaces with local principal components analysis. Second, we analyze a structural risk minimization-based algorithm for the regression of real-valued $C^{1, 1}$ functions defined on a manifold. (These are differentiable functions whose derivative is Lipschitz.) We assume that we are provided with $N$ sample points from $\M$, a $d$-dimensional $C^2$ submanifold of $\R^n$, as well as noisy observations from a $C^{1, 1}(\M)$ function $f^*$. We first present results of independent interest on sampling a $C^2$ atlas of $\M$ w.h.p. To do this, we 1) show that the sample contains, w.h.p., a fine-enough net of $\M$ of bounded size, 2) derive uniform convergence rates for the empirical measure indexed by particular subsets of $\M$, and 3) prove that tangent spaces estimated with local PCA are close in angular distance to the true tangent spaces w.h.p. The estimated tangent spaces can be used to define charts of $\M$ whose derivatives have a uniformly-bounded Lipschitz constant. After sampling an atlas, we use $C^{1, 1}(\R^d)$ regression to locally estimate the pullbacks of $f^*$ to the estimated tangent spaces and then use a partition of unity to define the global estimator of $f^*$. The $C^{1, 1}(\R^d)$ regression algorithm was developed by our collaborators (Gustafson, Hirn, Mohammed, Narayanan, and Xu, 2018), and it is closely related to a $C^{1, 1}(\R^d)$ interpolation algorithm of Herbert-Voss, Hirn, and McCollum (2017). We use tools from empirical processes to analyze the $C^{1, 1}(\R^d)$ regression algorithm, and we combine these results with properties of the $C^2$ atlas to derive risk bounds and convergence rates for our $C^{1, 1}(\M)$ regression algorithm. | |