Query Processing for Massively Parallel Systems

Koutris, Paraschos

Query Processing for Massively Parallel Systems

dc.contributor.advisor	Suciu, Dan	en_US
dc.contributor.author	Koutris, Paraschos	en_US
dc.date.accessioned	2015-09-29T18:00:50Z
dc.date.available	2015-09-29T18:00:50Z
dc.date.issued	2015-09-29
dc.date.submitted	2015	en_US
dc.description	Thesis (Ph.D.)--University of Washington, 2015	en_US
dc.description.abstract	The need to analyze and understand big data has changed the landscape of data management over the last years. To process the large amounts of data available to users in both industry and science, many modern data management systems leverage the power of massive parallelism. The challenge of scaling computation to thousands of processing units demands that we change our thinking on how we design such systems, and on how we analyze and design parallel algorithms. In this dissertation, I study the fundamental problem of query processing for modern massively parallel architectures. I propose a theoretical model, the MPC model (Massively Parallel Computation), to analyze the performance of parallel algorithms for query processing. In the MPC model, the data is initially evenly distributed among p servers. The computation proceeds in rounds: each round consists of some local computation followed by global exchange of data between the servers. The computational complexity of an algorithm is characterized by both the number of rounds necessary, and the maximum amount of data, or maximum load, that each processor receives. The challenge is to identify the optimal tradeoff between the number of rounds and maximum load for various computational tasks. As a first step towards understanding query processing in the MPC model, we study conjunctive queries (multiway joins) for a single round. We show that a particular type of distributed algorithm, the HyperCube algorithm, can optimally compute join queries when restricted to one communication round and data without skew. In most real-world applications, data has skew (for example a graph with nodes of large degree) that causes an uneven distribution of the load, and thus reduces the effectiveness of parallelism. We show that the HyperCube algorithm is more resilient to skew than traditional parallel query plans. To deal with any case of skew, we also design data-sensitive techniques that identify the outliers in the data and alleviate the effect of skew by further splitting the computation to more servers. In the case of multiple rounds, we present nearly optimal algorithms for conjunctive queries for the case of data without skew. A surprising consequence of our results is that they can be applied to analyze iterative computational tasks: we can prove that, in order to compute the connected components of a graph, any algorithm requires more than a constant number of communication rounds. Finally, we show a surprising connection of the MPC model with algorithms in the external memory model of computation.	en_US
dc.embargo.terms	Open Access	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.other	Koutris_washington_0250E_14969.pdf	en_US
dc.identifier.uri	http://hdl.handle.net/1773/33697
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.subject	big data; databases; parallelism; query processing	en_US
dc.subject.other	Computer science	en_US
dc.subject.other	computer science and engineering	en_US
dc.title	Query Processing for Massively Parallel Systems	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Koutris_washington_0250E_14969.pdf
Size:: 731.5 KB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering