Query Processing for Massively Parallel Systems
MetadataShow full item record
The need to analyze and understand big data has changed the landscape of data management over the last years. To process the large amounts of data available to users in both industry and science, many modern data management systems leverage the power of massive parallelism. The challenge of scaling computation to thousands of processing units demands that we change our thinking on how we design such systems, and on how we analyze and design parallel algorithms. In this dissertation, I study the fundamental problem of query processing for modern massively parallel architectures. I propose a theoretical model, the MPC model (Massively Parallel Computation), to analyze the performance of parallel algorithms for query processing. In the MPC model, the data is initially evenly distributed among p servers. The computation proceeds in rounds: each round consists of some local computation followed by global exchange of data between the servers. The computational complexity of an algorithm is characterized by both the number of rounds necessary, and the maximum amount of data, or maximum load, that each processor receives. The challenge is to identify the optimal tradeoff between the number of rounds and maximum load for various computational tasks. As a first step towards understanding query processing in the MPC model, we study conjunctive queries (multiway joins) for a single round. We show that a particular type of distributed algorithm, the HyperCube algorithm, can optimally compute join queries when restricted to one communication round and data without skew. In most real-world applications, data has skew (for example a graph with nodes of large degree) that causes an uneven distribution of the load, and thus reduces the effectiveness of parallelism. We show that the HyperCube algorithm is more resilient to skew than traditional parallel query plans. To deal with any case of skew, we also design data-sensitive techniques that identify the outliers in the data and alleviate the effect of skew by further splitting the computation to more servers. In the case of multiple rounds, we present nearly optimal algorithms for conjunctive queries for the case of data without skew. A surprising consequence of our results is that they can be applied to analyze iterative computational tasks: we can prove that, in order to compute the connected components of a graph, any algorithm requires more than a constant number of communication rounds. Finally, we show a surprising connection of the MPC model with algorithms in the external memory model of computation.