Managing Skew in the Parallel Evaluation of User-Defined Operations
MetadataShow full item record
Science and business are generating data at an unprecedented scale and rate due to ever evolving technologies in computing and sensors. Analyzing big data has become a key skill driving business and science. The challenges in big-data analysis stem not only from the data volume, but also from the diversity of data types to analyze (e.g., text, image, audio, video, and graph) and the various analyses beyond relational algebra that need to be performed (e.g., machine learning, natural language processing, image processing, and graph analysis). The user-defined operation (UDO) is a powerful mechanism to implement complex data processing tasks without changing the core of the parallel data processing engine. Although users can rapidly develop a new data analysis task with UDOs and execute the task in a cluster of computers, achieving high performance is important for users, especially those who do not have an extensive background in programming. This thesis focuses on addressing skew in parallel UDO evaluation. Skew is a problem when there exists a significant variance in the execution time of parallel tasks. In the presence of skew, the benefit of using a parallel system diminishes. Our detailed case study demonstrates that a new data analysis task can be rapidly implemented in a MapReduce-like system, but such implementation may be prone to skew problem during execution. A skew-resilient implementation is possible but requires significant implementation effort and expertise in programming. We also analyze the skew problem in three real workloads and show that skew problem is frequent (more than 40% of long running jobs experience skew). The thesis proposes two techniques to manage skew in parallel UDO evaluations: SkewReduce and SkewTune. SkewReduce is a static data partition optimization technique for feature-extracting applications that are common in scientific analysis. SkewReduce can improve the application runtime by up to 8x compared with a default MapReduce data partitioning strategy without any code-level optimization. SkewTune is a transparent dynamic skew mitigation technique for MapReduce applications. SkewTune can improve the application runtime by up to 4x compared with default MapReduce engine without modifying the application source code, without requiring any input from the developer or user, and without causing any side-effect during the execution.