Reward Shaping in Single and Multi-Agent Deep Reinforcement Learning

Poovendran, RadhaXiao, Baicen2022-01-262022-01-262022-01-262021Xiao_washington_0250E_23615.pdfhttp://hdl.handle.net/1773/48178Thesis (Ph.D.)--University of Washington, 2021Rapid strides made in the development of computing infrastructure over the last ten years have played a crucial role in significantly advancing the state-of-the-art in reinforcement learning. These developments have enabled the successful implementation of reinforcement learning algorithms in complex domains, including computer games, molecular design and robotics. However, it remains difficult for a reinforcement learning agent to learn to effectively complete tasks, especially when the reward provided by the environment is sparse or significantly delayed. In these situations, the agent will not be able to receive immediate feedback on its actions. This is called the credit assignment problem. In many scenarios, it is hard to design a reward scheme that is dense- i.e., a scheme that provides frequent or timely rewards to the agent at each intermediate time-step. It is therefore critical to develop tools that can guide a reinforcement learning agent towards promising solutions efficiently in environments with sparse reward. Reward shaping refers to a class of credit assignment methods, which augments the reward from the environment with an additional reward signal, with the goal of giving immediate feedback to the agent. The theme of this thesis is to integrate reward shaping into deep reinforcement learning algorithms to i) enhance the speed of learning; ii) improve the performance of the learned policies by allowing the agent to obtain higher rewards. In this thesis, we consider three different types of information that can be utilized to perform reward shaping on deep reinforcement learning algorithms. First, we focus on information which is in the form of potential functions. The difference between the values of a potential function at any two points is independent of the path taken in traveling from one point to the other. Potential-based shaping advice refers to a class of methods that uses the difference of potential functions as the shaping reward. We develop algorithms to impart a potential-based shaping advice scheme to policy gradient algorithms in both single-agent and multi-agent reinforcement learning. In some scenarios, the domain knowledge may not be in the form of potential functions. Instead, human operators may be available to provide feedback during training. Therefore, we seek to effectively integrate feedback signals supplied by a human operator with deep reinforcement learning algorithms in high-dimensional state spaces. We propose a framework FRESH (Feedback-based REward SHaping) which is designed to transform human feedback into a shaping reward that can be augmented to the environment reward. Potential-based advice and human expert, however, may not be available in scenarios where the number of agents increases or the environment is complex. Without prior domain knowledge, we propose an algorithm which can automatically perform effective long-term temporal credit assignment based on the interaction history of agents. In order to solve the temporal credit assignment problem in multi-agent environments with delayed rewards, it is critical to identify the relative importance of: i) each agent's state at any single timestep (agent dimension); ii) states along the length of an episode (temporal dimension). We introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these challenges. In each case, our experiments demonstrate that the reward shaping methods that we develop in this thesis help improve the state-of-the-art deep reinforcement learning algorithms to obtain higher average rewards and faster learning speeds.application/pdfen-USnoneDeep LearningReinforcement LearningReward ShapingElectrical engineeringReward Shaping in Single and Multi-Agent Deep Reinforcement LearningThesis