Scalable Query Evaluation over Complex Probabilistic Databases
MetadataShow full item record
The age of Big Data has brought with itself datasets which are not just big, but also much more complicated. These datasets are constructed from disparate, unreliable and noisy sources, many times in an ad-hoc way because careful data cleaning and integration is too time consuming and not always necessary anymore. Representing the uncertainty hidden in these datasets is necessary to get meaningful query answers and Probabilistic Databases have come up as arguably the most popular solution to this problem. Their application to practical problems though has been held back because (i) the common models they use are not rich enough to capture the dependencies in these problems, and (ii) unlike traditional databases, query evaluation for probabilistic databases can be very expensive and unpredictable. This dissertation addresses these challenges by first proposing a new model for probabilistic databases that is rich enough to capture the dependencies found in most practical applications, while still allowing for a translation to considerably simpler and well-studied models. Our model leverages existing models from AI literature that combine probability theory with logic. The main challenge of query evaluation over probabilistic databases is that it requires solving probabilistic inference which is a notoriously hard problem. This dissertation studies this problem via both (i) foundational results that give new theoretical insights about existing probabilistic inference algorithms, like Read-Once Formulas, Tree-Decompositions, Binary Decision Diagrams, Negation Normal Forms, when applied to the setting of probabilistic databases, which as we will see have their own distinct challenges and expectations, and (ii) building a robust system where the above ideas are leveraged for efficient and reliable query evaluation.