Building and Querying Probabilistic Models for Open World Database Systems
Abstract
A fundamental assumption of traditional database management systems is that the database contains all information necessary to answer a query; i.e., the database contains the entire population of data. However, with the increasing availability of public data samples (e.g., government data) and easy-to-use scientific programming languages (e.g., Python), data scientists are turning to samples to analyze and understand the population they represent. As databases do not treat stored data as samples, data scientists are forced to use tools outside of the database for their data processing needs. For database management systems to accommodate this growing group of users, they need to adopt the open world assumption that tuples not in the database still exist. In this dissertation, we answer two main research questions on how to build an open world database system that approximately answers queries as if they were issued over the entire population. The first question is: in an ideal setting where we can choose what statistics to gather about a population, how can we build a probabilistic model of the population that assumes all tuples have some nonzero probability of existing, i.e., the open world assumption. By using the Principle of Maximum Entropy, we built a prototype database system called EntropyDB that builds a probabilistic model for approximate query processing. The second question is: when the database just has access to a sample of the population and some population aggregate information, how can we automatically remove arbitrary selection bias to allow users to accurately answer population queries. We implement this automatic debiasing in Themis, the first open world database system that uses a priori population aggregate information to rebalance sample data. While there is still important future work in the field of open world database system, this dissertation presents the first step towards its realization.