Harchaoui, ZaidMehta, Ronak2025-10-022025-10-022025-10-022025-10-022025Mehta_washington_0250E_28783.pdfhttps://hdl.handle.net/1773/54130Thesis (Ph.D.)--University of Washington, 2025A fascinating phenomenon underlying statistical machine learning and artificial intelligence is "out-of-distribution" (or OOD) generalization. Data can (and in some settings, must) be used to draw inferences regarding probability distributions other than the one from which they were sampled. Understanding this mystery gives promise to statistical analyses that exhibit a degree of universality, such as clinical trials whose conclusions reflect many subpopulations or pre-defined image/text encodings that can be used to solve many classification tasks simultaneously. This dissertation tackles the theoretical and algorithmic challenges of designing methods that exhibit these modern notions of generalization. Chapter 2 studies a learning framework called distributionally robust optimization (DRO), which promotes OOD by training models to optimize the worst-case expected loss achievable within a collection of possible training distributions. These maximum-type objectives present challenges for designing stochastic learning algorithms, as unbiased estimates of the gradient are not easily computed. We design an estimator equipped with a progressive bias (and variance) reduction scheme, for which the resulting algorithm is shown to have a linear convergence guarantee. Although our optimization results apply more generally to DRO problems, we focus attention on a subclass of objectives called spectral risk measures, which have appealing statistical and computational properties previously unexplored in machine learning. We provide theoretical and practical guidance on selecting the various problem parameters, such as the collection of distributions over which to maximize. Finally, we present (among others) extensions to group DRO, a popular extension of the framework amenable to training neural network models. Chapter 3 takes insights from the DRO application and pursues stochastic algorithms for a more general class of optimization problems, dubbed semilinear min-max problems. These objectives interpolate between the well-understood class of bilinear and relatively less-understood nonbilinear min-max problems, and have applications to problem classes such as convex minimization with functional constraints as special cases. We present the first complexity guarantees for this problem class, using a randomized algorithm with components inspired by the simulation literature (such as adaptive sampling of new data and adaptive averaging of historical data). We prove convergence guarantees in both convex and strongly convex settings with a fine-grained dependence on individual problem constants. The results yield complexity improvements in even specific cases, such as bilinearly coupled problems. We also provide a lower complexity bound on the performance of deterministic algorithms applied to the semilinear problem class. Chapter 4 shifts focus from the implementation of large-scale learning algorithms to their output. We investigate predictive models that learn via a pre-training procedure with unlabeled data and can then make predictions for downstream classification tasks (without having seen any directly labeled training data from that task). This capability, known as zero-shot prediction, is made possible by three ingredients: 1) massive, carefully curated pre-training datasets, 2) "self-supervised" labels that allow models to learn universal features of structured data (e.g., images/text), and 3) the translation of downstream data into the format seen during pre-training using a technique called prompting. We analyze all three ingredients theoretically by establishing both the sample complexity and the limits of prompting in terms of simple distributional conditions. Inspired by this theory, we explore variants on the pre-training objective and prompting strategies that show practical benefits such as improved zero-shot classification accuracy.application/pdfen-USCC BY-NC-NDconvex optimizationdistribution shiftmachine learningmin-max optimizationself-supervised learningstatistical learningStatisticsComputer scienceApplied mathematicsStatisticsStatistical Learning from Shifting, Indirect, or Unseen Data: Efficient Algorithms and Theoretical GuaranteesThesis