Distributionally Robust Optimization for Reinforcement Learning

Song, Jun

Distributionally Robust Optimization for Reinforcement Learning

Files

Song_washington_0250E_26551.pdf (3.97 MB)

Date

2024-04-26

Authors

Song, Jun

Abstract

Reinforcement learning (RL) has received remarkable success in many domains, including video games, board games, robotics and continuous control tasks. Despite the success and attention that RL has received during the past decades, it struggles with several issues that degrade its performance and lead to suboptimality. In model-based RL, the uncertainty in environment dynamics can significantly deteriorate the learnt agent’s ability to recommend good actions. While in model-free RL, learnt agent's performance can be greatly affected by the restrictive parametric assumption on policy distribution. In this dissertation, our goal is to utilize distributionally robust optimization (DRO) to overcome the above-mentioned limitations of RL, and to develop novel and practical RL algorithms with improved robustness and performance. To achieve the goal, we follow two main objectives. The first objective is to adopt DRO to add robustness to the uncertainty in the environment dynamics of the model-based RL. We propose a new Distributionally Robust Markov Decision Process (DRMDP) framework where the distribution of environment dynamics does not have predetermined parametric values, and we consider the worst-case probability distribution of these transition probabilities within a decision-dependent ambiguity set. The second objective is to utilize optimistic DRO to develop nonparametric policy optimization methods for the model-free RL. Since the policy learnt is not confined to the scope of parametric functions, this opens up the possibility of converging to a better optimality. Following this objective, we propose three different nonparametric policy optimization frameworks, with Kullback–Leibler, Wasserstein and Sinkhorn constraints respectively to control the size of policy update. For each framework, we derive the closed-form policy update solution to the corresponding optimistic DRO problem using Lagrangian duality, and propose practical RL algorithms to perform the policy updates. We further improve the sample efficiency of the proposed nonparametric policy optimization frameworks, by incorporating human guidance through imitation learning techniques.