Managing Skew in the Parallel Evaluation of User-Defined Operations

dc.contributor.advisorBALAZINSKA, MAGDALENAen_US
dc.contributor.authorKWON, YONGCHULen_US
dc.date.accessioned2013-02-25T18:01:29Z
dc.date.available2013-02-25T18:01:29Z
dc.date.issued2013-02-25
dc.date.submitted2012en_US
dc.descriptionThesis (Ph.D.)--University of Washington, 2012en_US
dc.description.abstractScience and business are generating data at an unprecedented scale and rate due to ever evolving technologies in computing and sensors. Analyzing big data has become a key skill driving business and science. The challenges in big-data analysis stem not only from the data volume, but also from the diversity of data types to analyze (e.g., text, image, audio, video, and graph) and the various analyses beyond relational algebra that need to be performed (e.g., machine learning, natural language processing, image processing, and graph analysis). The user-defined operation (UDO) is a powerful mechanism to implement complex data processing tasks without changing the core of the parallel data processing engine. Although users can rapidly develop a new data analysis task with UDOs and execute the task in a cluster of computers, achieving high performance is important for users, especially those who do not have an extensive background in programming. This thesis focuses on addressing skew in parallel UDO evaluation. Skew is a problem when there exists a significant variance in the execution time of parallel tasks. In the presence of skew, the benefit of using a parallel system diminishes. Our detailed case study demonstrates that a new data analysis task can be rapidly implemented in a MapReduce-like system, but such implementation may be prone to skew problem during execution. A skew-resilient implementation is possible but requires significant implementation effort and expertise in programming. We also analyze the skew problem in three real workloads and show that skew problem is frequent (more than 40% of long running jobs experience skew). The thesis proposes two techniques to manage skew in parallel UDO evaluations: SkewReduce and SkewTune. SkewReduce is a static data partition optimization technique for feature-extracting applications that are common in scientific analysis. SkewReduce can improve the application runtime by up to 8x compared with a default MapReduce data partitioning strategy without any code-level optimization. SkewTune is a transparent dynamic skew mitigation technique for MapReduce applications. SkewTune can improve the application runtime by up to 4x compared with default MapReduce engine without modifying the application source code, without requiring any input from the developer or user, and without causing any side-effect during the execution.en_US
dc.embargo.termsNo embargoen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherKWON_washington_0250E_10987.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/22013
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectdatabase; load balancing; MapReduce; parallel database; skew; udoen_US
dc.subject.otherComputer scienceen_US
dc.subject.otherComputer science and engineeringen_US
dc.titleManaging Skew in the Parallel Evaluation of User-Defined Operationsen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KWON_washington_0250E_10987.pdf
Size:
4.15 MB
Format:
Adobe Portable Document Format