Perspectives on Policy Learning

Ainsworth, Samuel Kenneth

Perspectives on Policy Learning

Files

Ainsworth_washington_0250E_24746.pdf (8.91 MB)

Date

2022-09-23

Authors

Ainsworth, Samuel Kenneth

Abstract

Sequential decision making, especially in the face of uncertainty, is a central challenge in our quest to build increasingly safe, capable, and (seemingly-)intelligent autonomous systems. Whereas supervised learning is concerned with selecting optimal actions in independent interactions with an environment, policy learning studies action selection in sequential, dependent interactions in an environment. Policy learning constitutes an elemental component of many techniques across reinforcement learning, optimal control, and robotics. In this dissertation, a variety of perspectives on policy learning are presented, each taking a slightly different lens to the problem. We analyze theoretical and practical speedups to policy learning, and explore subtle yet critical ways in which it differs from supervised learning. We proceed in three parts,1. In Chapter 2 we study which interactions with the environment are most beneficial, sparked by the intuition that many environments include ``dead end'' areas of state space. We extend this framework to additionally study the tradeoffs associated with safety interventions in real-world deployments of policy learning techniques. We analyze the regret behavior of emergency stopping and present empirical results in discrete and continuous settings demonstrating that our reset mechanism can provide order-of-magnitude speedups on top of existing reinforcement learning methods. 2. In Chapter 3 we study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible to construct a more efficient and accurate gradient estimator. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing conventional policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators. 3. Intrigued by the failures of policy gradient methods in certain settings, we study how policy learning differs from supervised learning in Chapter 4. In particular we explore the unreasonable effectiveness of stochastic gradient descent operating in supervised learning loss landscapes. We hypothesize that this ease is due to such loss landscapes effectively having only a single basin, modulo symmetries in the model parameterization. We propose three novel canonicalization algorithms to reconcile the symmetries between model weights, and justify the single-basin claim with experiments across a variety of model architectures and datasets including the first demonstration -- to the best of our knowledge -- of perfect linear mode connectivity between two completely independently trained large ResNet models.