Performance-Based Service Level Agreements for Data Analytics in the Cloud
MetadataShow full item record
A variety of data analytics systems are available as cloud services today, such as Amazon Elastic MapReduce (EMR) and Azure Data Lake Analytics. To buy these services, users select and pay for a given cluster configuration based on the number and type of service instances. Today's cloud service pricing models force users to translate their data management needs into resource needs. It is well known, however, that users have difficulty selecting a configuration that meets their need. For non-experts, being faced with decisions about the configuration is even harder, especially when they seek to explore a new dataset. This thesis focuses on the challenges and implementation details of building a system that helps bridge the gap between the data analytics services users need and the way cloud providers offer them. The first challenge in closing the gap is finding a new type of abstraction that simplifies user interactions with cloud services. We introduce the notion of a "Personalized Service Level Agreement" (PSLA) and the PSLAManager system that implements it. Instead of asking users to specify the exact resources they think they need or asking them for exact queries that must be executed, PSLAManager shows them service options for a set price. Second, providing PSLAs is challenging to service providers who seek to avoid paying for SLA violations and over-provisioning their resources. To address these challenges, we present SLAOrchestrator, a system that supports performance-centric (rather than availability-centric) SLAs for data analytic services. SLAOrchestrator uses PSLAManager to generate SLAs; however, to reduce their resource and SLA violation costs, it introduces a new sub-system called PerfEnforce that uses scaling algorithms and provisioning techniques to minimize the financial penalties to providers. Finally, SLAOrchestrator's models rely, among others, on the cost estimates from the query optimizer of the data analytics service. With more complex workloads, these estimates can be inaccurate due to the cardinality estimation problem. In the DeepQuery project, we empirically evaluate how deep learning can improve the accuracy of cardinality estimation.