Information-directed policy sampling for episodic Bayesian Markov decision processes
Date
relationships.isAuthorOf
Diaz, Victoria
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The research objective of this dissertation is to apply information theoretic methods to design provably efficient approximate solution algorithms for Markov decision processes (MDPs), partially observable MDPs (POMDPs), and hierarchical MDPs, under incomplete information. We consider these problems within an episodic Bayesian framework, where the decision-maker interacts with a stochastic system repeatedly over T episodes comprising of N stages each. The decision-maker only knows that the true parameters describing the stochastic system take values from a particular finite set. The decision-maker begins the first episode with a prior probabilistic belief about the true parameters of the system, and updates this belief at the end of each episode based on observed events. The decision-maker wishes to maximize expected total reward earned over all episodes under such incomplete information. The challenge of balancing exploration versus exploitation is at the heart of this dissertation. The decision-maker should execute policies that provide information about the true parameters of the system (exploration), but should also exploit this acquired knowledge to implement policies that earn high rewards. Exact methods that attempt to balance this trade-off are computationally intractable due to the curse-of-dimensionality. Approximate solution methods are thus desired, but often are only available as heuristics with no or poor regret bounds. To overcome these limitations, this dissertation proposes a framework whereby, in eachepisode, the decision-maker executes a policy sampled from a probability mass function (pmf) that minimizes a so-called convex information ratio. The numerator of this information ratio equals the squared-regret incurred and the denominator equals the information gained about the true parameters of the system, by executing such a policy. Minimizing this ratio is thus a natural way to balance the exploration-exploitation trade-off. We call the resulting framework information-directed policy sampling (IDPS). This idea is motivated by the recent theoretical and computational success of a paradigm called information-directed sampling in balancing this trade-off in the special case of multi-armed bandit problems. However, the dependence of future states on current state-action pairs poses unique technical hurdles while generalizing this idea to Markovian systems. We tackle this challenge by introducing a new way to define the episodic regret and information gain using pmfs over the set of policies that are optimal under distinct system parameters, instead of the set of all policies. We derive regret bounds that do not depend on the state-space, action-space, or observation- space cardinalities. Instead, our regret bounds scale elegantly with the number of episodes T, number of possible parameter values, number of stages N, and the entropy of prior belief. The proposed algorithms are compared computationally against a state-of-the-art approach called Posterior Sampling (PS) on three applications: queuing control, machine repair, and dynamic pricing. The thesis is organized is as follows. The first chapter investigates MDPs where the decision-maker has incomplete information about the state transition probabilities and single-stage rewards. A regret bound for IDPS is derived, and numerical experiments show that IDPS outperforms PS on all three applications. The second chapter studies POMDPs where the decision-maker has incomplete information about the state transition probabilities and the observation probabilities. A regret bound for IDPS is derived. The third chapter relates to MDPs with a hierarchical incomplete information framework. The upper level of this hierarchy includes ambiguity about which structural model characterizes the true system-dynamics, and the lower level corresponds to the ambiguity regarding the true parameters of these potential models. For instance, the decision-maker may not know whether the true demand model is Poisson or Binomial. Further, if the true model is Poisson, then the decision-maker may not know its mean. Three variations of IDPS are introduced, and a regret bound for one such variation is derived. Computational experiments consider a hierarchical variant of the dynamic pricing application. Future research could focus on extending the framework and theoretical analyses in this dissertation to other settings such as indefinite-horizon MDPs, continuous-time MDPs, semi- Markov decision processes, and multi-player stochastic games.
Description
Thesis (Ph.D.)--University of Washington, 2022
