Stochastic Gradient Descent For Modern Machine Learning: Theory, Algorithms And Applications
MetadataShow full item record
Tremendous advances in large scale machine learning and deep learning have been powered by the seemingly simple and lightweight stochastic gradient method. Variants of the stochastic gradient method (based on iterate averaging) are known to be asymptotically optimal (in terms of predictive performance). This thesis examines non-asymptotic issues surrounding the use of stochastic gradient descent (SGD) in practice with an aim to achieve its asymptotically optimal statistical properties. Focusing on the stochastic approximation problem of least squares regression, this thesis considers: 1. Understanding the benefits of tail-averaged SGD, and understanding how SGD's non-asymptotic behavior is influenced when faced with mis-specified problem instances. 2. Understand the parallelization properties of SGD, with a specific focus on mini-batching, model averaging and batch size doubling. Can this characterization shed light on algorithmic regimes (for e.g. largest instance dependent batch sizes) that admit linear parallelization speedups over vanilla SGD (with a batch size 1), thus presenting useful prescriptions that make best use of our hardware resources whilst not being wasteful of computation? As a byproduct of these results, can we understand how the learning rate behaves as a function of the batch size? 3. Similar to how momentum/acceleration schemes such as heavy ball momentum, or Nesterov's acceleration improve over standard batch gradient descent, can we formalize improvements achieved by accelerated methods when working with sampled stochastic gradients? Is there an algorithm that achieves this improvement over SGD? How does deterministic accelerated schemes such as heavy ball momentum, or say, Nesterov's acceleration work when used with sampled stochastic gradients? 4. This thesis considers the behavior of the final iterate of SGD (as opposed to a majority of efforts in the stochastic approximation literature which focus on iterate averaging) with varying stepsize schemes, including the standard polynomially decaying stepsizes and the practically preferred step decay scheme, with an aim to achieve minimax rates. The overarching goal of this section is to understand the behavior of SGD's final iterate owing to its widespread use in practical implementations for machine learning applications. Alongside the theory results that focus on the least squares regression, this thesis examines the general applicability of various results (in a qualitative sense) towards the problem of training multi-layer deep neural networks on benchmark datasets, and presents several useful implications when training deep learning models of practical interest.
- Electrical engineering