Information theoretic learning methods for Markov decision processes with parametric uncertainty
MetadataShow full item record
Markov decision processes (MDPs) model a class of stochastic sequential decision problems with applications in engineering, medicine, and business analytics. There is considerable interest in the literature in MDPs with imperfect information, where the search for well-performing policies faces many challenges. There is no rigorous universally accepted optimality criterion. The decision-maker suffers from the curse-of-dimensionality. Finding good policies requires careful balancing of the trade-off between exploration to acquire information and exploitation of this information to earn high rewards. This dissertation contributes to this area by building a rigorous framework rooted in information theory for solving MDPs with model uncertainty. In the first part, the value of a parameter that characterizes the transition probabilities is unknown to the decision-maker. Information Directed Policy Sampling (IDPS) is proposed to manage the exploration-exploitation trade-off. A generalization of Hoeffding's inequality is employed to derive a regret bound. Numerical results on a stylized example, an auction-design problem, and a response-guided dosing problem are discussed. Uncertainty in transition probabilities arises from two levels in the second part. The top level corresponds to the ambiguity about the system model. Bottom-level uncertainty is rooted in the unknown parameter values for each possible model. Prior-update formulas using a hierarchical Bayesian framework are derived and incorporated into two learning algorithms: Thompson Sampling and a hierarchical extension of IDPS. Analytical performance bounds are developed. Numerical results on the response-guided dosing problem are presented. The third part extends the above to partially observable Markov decision processes (POMDPs). A connection between POMDPs and the first two chapters is exploited to devise algorithms and provide analytical performance guarantees in three cases: a) uncertainty in the transition probabilities; b) uncertainty in the measurement outcome probabilities; and c) uncertainty in both. Numerical results on partially observed response-guided dosing are included. The fourth part develops a formal information theoretic framework inspired by stochastic thermodynamics. It utilizes the idea that information is physical. An explicit link between information entropy and stochastic dynamics of a system coupled to an environment is developed from fundamental principles. Unlike the heuristic idea of the information ratio, this provides an optimization program that is built from system dynamics, problem objective, and feedback from observations. To the best of my knowledge, this is the first framework that is entirely grounded in system and informational dynamics without relying on heuristic criteria.