High-performance parallel systems for data-intensive computing
Myers, Brandon David
MetadataShow full item record
Applications in data science rely on two computing paradigms: tuned high performance parallel programs and data analytics. While historically their differences were good reason to separate the paradigms into different systems, recent changes in hardware and, as a result, fast data processing techniques, call this separation into question. The goal of this dissertation is to present systems and experiments that combine high performance parallel programs and data analytics for performance while preserving programmability. First, I present Grappa, a distributed parallel programming language implementation designed for building high performance data-intensive systems with less effort. Grappa provides a simple fine-grained programming model, while using the parallelism inherent in data-intensive applications to execute the program efficiently. Using Grappa we built native applications, and domain-specific frameworks for dataflow processing, graph processing, and relational query processing that are faster than their domain-specific counterparts. Then, I present a survey on techniques for fast in-memory query evaluation. Using existing literature, I classified overheads in the conventional techniques for query evaluation and techniques that address them. In particular, I focus on query compilation, which specializes the query processor to the particular query. These ideas inspire the design of the second system. Finally, I present Radish, a query processing engine built upon Grappa. Radish uses efficient distributed data structures, avoids extra messages, and uses Grappa’s runtime to execute fine-grained, tuple-by-tuple evaluation efficiently. I also developed a new query compilation technique that generates parallel code for entire processing pipelines. This compilation technique increased performance by 2.4× compared to generating fragments of code. Radish is also competitive with other distributed parallel data processing systems. In this dissertation, I provide supporting evidence for my thesis statement: When applied to data-intensive applications, high-performance parallel systems and new database query evaluation techniques support improved performance, programmer productivity, and closer interaction between handwritten parallel programs and declarative queries.