Distributionally Robust Optimization for Reinforcement Learning

dc.contributor.advisorZhao, Chaoyue
dc.contributor.authorSong, Jun
dc.date.accessioned2024-04-26T23:21:19Z
dc.date.available2024-04-26T23:21:19Z
dc.date.issued2024-04-26
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractReinforcement learning (RL) has received remarkable success in many domains, including video games, board games, robotics and continuous control tasks. Despite the success and attention that RL has received during the past decades, it struggles with several issues that degrade its performance and lead to suboptimality. In model-based RL, the uncertainty in environment dynamics can significantly deteriorate the learnt agent’s ability to recommend good actions. While in model-free RL, learnt agent's performance can be greatly affected by the restrictive parametric assumption on policy distribution. In this dissertation, our goal is to utilize distributionally robust optimization (DRO) to overcome the above-mentioned limitations of RL, and to develop novel and practical RL algorithms with improved robustness and performance. To achieve the goal, we follow two main objectives. The first objective is to adopt DRO to add robustness to the uncertainty in the environment dynamics of the model-based RL. We propose a new Distributionally Robust Markov Decision Process (DRMDP) framework where the distribution of environment dynamics does not have predetermined parametric values, and we consider the worst-case probability distribution of these transition probabilities within a decision-dependent ambiguity set. The second objective is to utilize optimistic DRO to develop nonparametric policy optimization methods for the model-free RL. Since the policy learnt is not confined to the scope of parametric functions, this opens up the possibility of converging to a better optimality. Following this objective, we propose three different nonparametric policy optimization frameworks, with Kullback–Leibler, Wasserstein and Sinkhorn constraints respectively to control the size of policy update. For each framework, we derive the closed-form policy update solution to the corresponding optimistic DRO problem using Lagrangian duality, and propose practical RL algorithms to perform the policy updates. We further improve the sample efficiency of the proposed nonparametric policy optimization frameworks, by incorporating human guidance through imitation learning techniques.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherSong_washington_0250E_26551.pdf
dc.identifier.urihttp://hdl.handle.net/1773/51371
dc.language.isoen_US
dc.rightsCC BY-NC-ND
dc.subjectDistributionally Robust Optimization
dc.subjectMarkov Decision Process
dc.subjectReinforcement Learning
dc.subjectWasserstein Metric
dc.subjectIndustrial engineering
dc.subjectComputer science
dc.subjectMathematics
dc.subject.otherIndustrial engineering
dc.titleDistributionally Robust Optimization for Reinforcement Learning
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Song_washington_0250E_26551.pdf
Size:
3.97 MB
Format:
Adobe Portable Document Format