Query Processing for Massively Parallel Systems

dc.contributor.advisorSuciu, Danen_US
dc.contributor.authorKoutris, Paraschosen_US
dc.date.accessioned2015-09-29T18:00:50Z
dc.date.available2015-09-29T18:00:50Z
dc.date.issued2015-09-29
dc.date.submitted2015en_US
dc.descriptionThesis (Ph.D.)--University of Washington, 2015en_US
dc.description.abstractThe need to analyze and understand big data has changed the landscape of data management over the last years. To process the large amounts of data available to users in both industry and science, many modern data management systems leverage the power of massive parallelism. The challenge of scaling computation to thousands of processing units demands that we change our thinking on how we design such systems, and on how we analyze and design parallel algorithms. In this dissertation, I study the fundamental problem of query processing for modern massively parallel architectures. I propose a theoretical model, the MPC model (Massively Parallel Computation), to analyze the performance of parallel algorithms for query processing. In the MPC model, the data is initially evenly distributed among p servers. The computation proceeds in rounds: each round consists of some local computation followed by global exchange of data between the servers. The computational complexity of an algorithm is characterized by both the number of rounds necessary, and the maximum amount of data, or maximum load, that each processor receives. The challenge is to identify the optimal tradeoff between the number of rounds and maximum load for various computational tasks. As a first step towards understanding query processing in the MPC model, we study conjunctive queries (multiway joins) for a single round. We show that a particular type of distributed algorithm, the HyperCube algorithm, can optimally compute join queries when restricted to one communication round and data without skew. In most real-world applications, data has skew (for example a graph with nodes of large degree) that causes an uneven distribution of the load, and thus reduces the effectiveness of parallelism. We show that the HyperCube algorithm is more resilient to skew than traditional parallel query plans. To deal with any case of skew, we also design data-sensitive techniques that identify the outliers in the data and alleviate the effect of skew by further splitting the computation to more servers. In the case of multiple rounds, we present nearly optimal algorithms for conjunctive queries for the case of data without skew. A surprising consequence of our results is that they can be applied to analyze iterative computational tasks: we can prove that, in order to compute the connected components of a graph, any algorithm requires more than a constant number of communication rounds. Finally, we show a surprising connection of the MPC model with algorithms in the external memory model of computation.en_US
dc.embargo.termsOpen Accessen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherKoutris_washington_0250E_14969.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/33697
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectbig data; databases; parallelism; query processingen_US
dc.subject.otherComputer scienceen_US
dc.subject.othercomputer science and engineeringen_US
dc.titleQuery Processing for Massively Parallel Systemsen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Koutris_washington_0250E_14969.pdf
Size:
731.5 KB
Format:
Adobe Portable Document Format