Certifiable Algorithms for Reinforcement Learning: Safety-Critical and Game-Theoretic Perspectives

Zheng, Liyuan

Certifiable Algorithms for Reinforcement Learning: Safety-Critical and Game-Theoretic Perspectives

Files

Zheng_washington_0250E_23441.pdf (3.45 MB)

Date

2021-10-29

Authors

Zheng, Liyuan

Abstract

Reinforcement learning has seen significant advances over the last decade in simulated or controlled dynamic systems. These successes have lead to interests in deploying learning algorithms in more complex environments such as safety-critical and multi-agent settings. However, in those environments, some important certifications of existing reinforcement learning algorithms including safety constraints satisfaction and convergence guarantees are lacking. Thus, certifiable reinforcement learning algorithms are desired and we introduce our proposed algorithms in this thesis. First, we tackle the problem on finding reinforcement learning policies for control systems with pre-defined state and action constraints. We propose a new approach, termed Vertex Networks, with guarantees on safety during both the exploration and execution stages, by incorporating the safety constraints into the policy network architecture. Leveraging the geometric property that all points within a convex set can be represented as the convex combination of its vertices, the proposed algorithm first learns the convex combination weights and then uses these weights along with the pre-calculated vertices to output an action. The output action is guaranteed to be safe by construction and numerical examples illustrate that the algorithm outperforms other baseline methods. Second, we address safe reinforcement learning problem with known transition kernel and unknown constraints. We leverage Constrained Markov Decision Process, which is a class of stochastic decision problems in which the decision maker must select a policy that satisfies auxiliary cost constraints. We present an algorithm C-UCRL and show that it achieves sub-linear regret ($O(T^{\frac{3}{4}}\sqrt{\log(T/\delta)})$) with respect to the reward while satisfying the constraints even while learning with probability $1-\delta$. As an extension to unknown transition kernel setting, we present a lower bound on constraint violation, which proves the inevitability of constraint violation of algorithms in constrained Markov decision process. Third, we present our work on game-theoretic reinforcement learning, where we formulate actor-critic algorithms as a Stackelberg game. We adopt the game-theoretic viewpoint and model the actor and critic interaction as a two-player general-sum game with a leader-follower structure known as a Stackelberg game. We propose a meta-framework for Stackelberg actor-critic algorithms where the leader player follows the total derivative of its objective instead of the usual individual gradient. From a theoretical standpoint, we develop a policy gradient theorem for the refined update and provide a local convergence guarantee for the Stackelberg actor-critic algorithms to a local Stackelberg equilibrium. From an empirical standpoint, we demonstrate via simple examples that the learning dynamics we study mitigate cycling and accelerate convergence compared to the usual gradient dynamics given cost structures induced by actor-critic formulations. Finally, we study two-player competitive reinforcement learning problem. In order to let one player take advantage from the competition and benefit from learning with a learning opponent, we adopt the hierarchical Stackelberg game formulation and proposed the novel Stackelberg MADDPG algorithm. We also design and open-source new competitive reinforcement learning benchmark tasks and demonstrate the performance and behavior of our algorithm on them. The contributions of this thesis are steps towards certifiable reinforcement learning algorithms from safety and game-theoretic perspectives under complex environments.