Managing Skew in the Parallel Evaluation of User-Defined Operations

KWON, YONGCHUL

Managing Skew in the Parallel Evaluation of User-Defined Operations

dc.contributor.advisor	BALAZINSKA, MAGDALENA	en_US
dc.contributor.author	KWON, YONGCHUL	en_US
dc.date.accessioned	2013-02-25T18:01:29Z
dc.date.available	2013-02-25T18:01:29Z
dc.date.issued	2013-02-25
dc.date.submitted	2012	en_US
dc.description	Thesis (Ph.D.)--University of Washington, 2012	en_US
dc.description.abstract	Science and business are generating data at an unprecedented scale and rate due to ever evolving technologies in computing and sensors. Analyzing big data has become a key skill driving business and science. The challenges in big-data analysis stem not only from the data volume, but also from the diversity of data types to analyze (e.g., text, image, audio, video, and graph) and the various analyses beyond relational algebra that need to be performed (e.g., machine learning, natural language processing, image processing, and graph analysis). The user-defined operation (UDO) is a powerful mechanism to implement complex data processing tasks without changing the core of the parallel data processing engine. Although users can rapidly develop a new data analysis task with UDOs and execute the task in a cluster of computers, achieving high performance is important for users, especially those who do not have an extensive background in programming. This thesis focuses on addressing skew in parallel UDO evaluation. Skew is a problem when there exists a significant variance in the execution time of parallel tasks. In the presence of skew, the benefit of using a parallel system diminishes. Our detailed case study demonstrates that a new data analysis task can be rapidly implemented in a MapReduce-like system, but such implementation may be prone to skew problem during execution. A skew-resilient implementation is possible but requires significant implementation effort and expertise in programming. We also analyze the skew problem in three real workloads and show that skew problem is frequent (more than 40% of long running jobs experience skew). The thesis proposes two techniques to manage skew in parallel UDO evaluations: SkewReduce and SkewTune. SkewReduce is a static data partition optimization technique for feature-extracting applications that are common in scientific analysis. SkewReduce can improve the application runtime by up to 8x compared with a default MapReduce data partitioning strategy without any code-level optimization. SkewTune is a transparent dynamic skew mitigation technique for MapReduce applications. SkewTune can improve the application runtime by up to 4x compared with default MapReduce engine without modifying the application source code, without requiring any input from the developer or user, and without causing any side-effect during the execution.	en_US
dc.embargo.terms	No embargo	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.other	KWON_washington_0250E_10987.pdf	en_US
dc.identifier.uri	http://hdl.handle.net/1773/22013
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.subject	database; load balancing; MapReduce; parallel database; skew; udo	en_US
dc.subject.other	Computer science	en_US
dc.subject.other	Computer science and engineering	en_US
dc.title	Managing Skew in the Parallel Evaluation of User-Defined Operations	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: KWON_washington_0250E_10987.pdf
Size:: 4.15 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering