Mixed Integer Quadratic Optimization for Learning Directed Acyclic Graphs from Continuous Data
MetadataShow full item record
The study of probabilistic graphical models (PGMs) is an essential topic in statistics and machine learning fields. Bayesian networks (BNs), arguably one of the most central classes of PGMs, is frequently used to represent causal relations among a set of random variables in complex systems. A Bayesian network (BN) is a PGM that consists of a labeled directed acyclic graph (DAG) in which the vertices in the vertex set correspond to random variables (nodes), and the edge set prescribe a decomposition of the joint probability distribution of nodes such that the value of any node is a probabilistic function of the values of the nodes which are its parents in the DAG. The edge set encodes Markov conditions on the nodes in the sense that each node is conditionally independent of its non-descendents given its parents. While statistical properties of BNs from continuous data have been extensively studied, the development of efficient computational tools for learning an optimal DAG remains an open challenge. The goal is to learn a DAG structure that maximizes a score function. One such score metric is the posteriori probability of the DAG structure given data. Learning DAGs from observational data is a computationally difficult task, because the number of possible DAGs scales superexponentially with the number of nodes. In this work, we propose novel discrete optimization formulations for learning DAGs for continuous variables. Learning DAG from continuous data can be cast as a mixed-integer quadratic programming (MIQP) with the objective function as the penalized negative log-likelihood (PNL) and an L0 regularization penalty subject to linear constraints. There are two key challenges: (i) imposing a set of constraints to remove cycles from a directed graph, (ii) enforcing tight bounds on the semi-continuous optimization variables corresponding to the arc weights. We tackle the first challenge by presenting a way to remove cycles which results in a new MIQP formulation with linear constraints, referred to as a layered network (LN). We establish that LN is a compact formulation and the objective value of its continuous relaxation is as tight as stronger but larger formulations under a mild condition. An additional benefit of the LN formulation is that it effectively incorporates a prior structural knowledge (super-structure) in order to reduce the set of possible candidate DAGs. Computational results indicate that the proposed formulation outperforms existing mathematical formulations and scales better than available algorithms that can solve the same problem with only L1 regularization, especially in the presence of a sparse super-structure To model semi-continuous variables, a common practice is to use a standard “big-M constraint” in a MIQP. This is commonly modeled using a standard “big-M constraint” in the associated mixed-integer program. However, this strategy leads to a poor continuous relaxation because there is no natural upper bound for the arc weights. To circumvent this deficiency, we present a mixed-integer second-order cone program (MISOCP), which has tighter continuous relaxation bounds than the existing formulations based on big-M constraints – including the LN formulation. We show the promising performance of the MISOCP in terms of reducing the optimality gap when compared to the best existing optimization formulations. The performance of each formulation depends on the size and tightness of its continuous relaxation. This work highlights that the best formulation applies LN constraints to remove cycles to keep the size of the optimization problem small while using conic constraints to tighten the semi-continuous variables.