Percentile and inverse optimization in Markov decision processes with extensions to convex programs

Ghatrani, Zahra

Percentile and inverse optimization in Markov decision processes with extensions to convex programs

Files

Ghatrani_washington_0250E_23672.pdf (844.28 KB)

Date

2022-01-26

relationships.isAuthorOf

Ghatrani, Zahra

Abstract

Infinite-horizon stationary Markov decision processes (MDPs) have been studied extensively in the literature. Over the last sixty years, they have found applications in a broad range of areas such as healthcare, telecommunications, transportation, revenue management, supply chain and inventory management, scheduling, resource allocation, autonomous systems, and reinforcement learning. An MDP is described as follows. At the beginning of each time-step, a decision-maker observes the state of a stochastic system. The decision-maker then chooses an action. The system probabilistically evolves into a new state and the decision-maker earns a reward. The transition probability and the reward both depend on the current state and the action chosen therein. The decision-maker's goal is to choose actions such that the expected total discounted reward over an infinite horizon is maximized. It is often assumed in the literature that the transition probabilities and the rewards are known to the decision-maker. In practice, however, these are estimated subject to errors. The question then arises as to how the decision-maker can incorporate its incomplete knowledge about these parameters into the decision-making process. At least three different approaches have been proposed: robust optimization, inverse optimization, and percentile optimization. This dissertation makes methodological contributions to two of these three: percentile and inverse optimization. Percentile optimization in MDPs accounts for uncertainty in reward parameters by choosing decisions that maximize the $\beta$-percentile of the expected total discounted reward over an infinite horizon. This approach is similar to chance-constrained optimization. It turns out that, when rewards are multivariate Gaussian, the percentile optimization problem in MDPs can be reformulated as a second-order cone program (SOCP). Multi-armed bandit (MAB) problems are perhaps the most studied special case of MDPs. Unfortunately, an as-is application of the existing percentile optimization methodology is intractable for MABs. In particular, the resulting SOCP suffers from the curse-of-dimensionality because its size is exponential in the number of arms. The idea in inverse optimization is to recover implied parameter values from observed decisions. In the context of MDPs, this approach has been applied to recover reward values that render observed decisions optimal. It is known that this results in a linear program (LP) that can be solved efficiently. A counterpart of this approach is not available for imputing transition probabilities. The challenge is that since transition probabilities appear on the left hand side as constraint coefficients in an LP formulation of an infinite-horizon MDP, the inverse problem turns out to be nonconvex bilinear. More generally, there are only a few studies in the literature that focus on applying inverse optimization framework to LPs or convex programs with unknown constraint parameters. The research objective of this dissertation is to apply convex optimization methods to efficiently compute approximate solutions of (i) percentile optimization problems in MABs, and more generally, in weakly coupled MDPs under Gaussian rewards; (ii) inverse problems in MDPs with unknown transition probabilities; and (iii) inverse semidefinite programs (SDPs). The dissertation is organized as follows. {\bf Percentile optimization in MAB problems}: the {\em first} chapter focuses on MAB problems whose traditional version can be described as follows. At each time-step, a decision-maker selects one arm from a finite set, after observing the states of all arms. A reward is earned from this arm and its state evolves stochastically. No reward is earned from other arms, and their states do not change. The goal is to determine an arm-pulling policy that maximizes the expected total discounted reward over an infinite-horizon. The chapter considers the more challenging case where rewards are multivariate Gaussian with possible correlations across states, to account for estimation errors. This is motivated by recent work on percentile optimization in MDPs. We demonstrate that, when applied to MABs, this yields an intractable SOCP with size exponential in the number of arms. The chapter proposes a Lagrangian relaxation method to break this curse-of-dimensionality. This relaxation dualizes the restriction that exactly one arm must be played at each time-step. The optimal value of the relaxed problem provides an upper bound on the exact percentile problem. Moreover, the relaxation achieves a decomposition across arms, which exponentially reduces the computational complexity. The chapter then applies convex strong duality to formulate the problem of finding the tightest upper bound (and the corresponding best Lagrange multiplier) as a tractable SOCP. We propose three approaches to recover feasible arm-pulling decisions during run-time from an off-line optimal solution of this SOCP. Numerical experiments suggest that one of these three method appears to be more effective than the other two. This methodology is also extended to a broader class of problems called weakly coupled MDPs. There, we propose four methods to recover run-time decisions from an off-line optimal solution of an SOCP. Our numerical results suggest that three of these methods perform better than the fourth one, and that one of these three methods seems to work better for larger problems than the other two. {\bf Inverse MDPs with unknown transition probabilities}: the {\em second} chapter considers two variants of this problem. In the first variant, the decision-maker wonders whether there exist transition probabilities and corresponding decisions that would attain a given expected total discounted reward over an infinite-horizon (the so-called value function). An easy-to-verify necessary and sufficient condition for this existence is derived. The chapter demonstrates that when this condition is met, the requisite transition probabilities and decisions can be imputed by solving a feasibility LP. These ideas are then extended to the case when the decision-maker wishes to render the given value function optimal. The chapter then turns to the more difficult problem of imputing transition probabilities that make given decisions optimal. LP strong duality is applied to this problem to derive a nonconvex bilinear program. Tailored versions of two heuristics that exploit the structure of this bilinear program are proposed. The first one is rooted in a so-called convex-concave procedure (CCP) for a class of problems called ``difference of convex'' programs. The second one is called sequential linear programming (SLP). The performance of these two methods is compared via numerical experiments against an exact global optimization method based on generalized Bender's decomposition. Computational experiments on randomly generated inverse generic MDP problems reveal that SLP outperforms the other two methods in both runtime and objective values. Further insights into SLP’s performance are derived via numerical experiments on inverse inventory control, equipment replacement, and multi-armed bandit problems. {\bf Inverse semidefinite programs with unknown constraint parameters}: The {\em third} chapter focuses on extension of the methods discussed in the second chapter to the broader class of convex optimization problems called semi-definite programs (SDPs). An inverse optimization methodology for SDPs with unknown constraint parameters is currently not available. Similar to the second chapter, when constraint parameters are unknown, the resulting inverse SDP problem is nonconvex bilinear. This chapter focuses on six variants of this inverse problem constructed based on which parameters and decision variables are known. In each case, we will identify when the inverse problem is trivial to solve and when it is not. We will show that in one variant, the inverse SDP problem reduces to solving a group of SDPs. In two of the other variants, we provide conditions under which the inverse problem can be reduced to a tractable SDP via a variable transformation. In all other cases, we apply tailored versions of the heuristics proposed in the second chapter to obtain an approximate solution. Another heuristic called Alternate Convex Search (ACS) is also implemented in some cases. The performance of these methods is compared via numerical experiments.