Optimizing Distributed Systems using Machine Learning

Krishnamurthy, ArvindCano, Ignacio Agustin2019-05-022019-05-022019-05-022019Cano_washington_0250E_19622.pdfhttp://hdl.handle.net/1773/43659Thesis (Ph.D.)--University of Washington, 2019Distributed systems consist of many components that interact with each other to perform certain task(s). Traditionally, many of these systems base their decisions on sets of rules or configurations defined by operators as well as handcrafted analytical models. However, creating those rules or engineering such models is a challenging task. First, the same system should be able to work under a combinatorial number of conditions on top of heterogeneous hardware. Second, they should support different type of workloads and run in potentially widely different settings. Third, they should be able to handle time-varying resource needs. These factors render reasoning about distributed systems' performance in general far from trivial. In this thesis, we propose optimizing distributed systems using machine learning (ML). Our main contribution is the design, implementation, augmentation, and evaluation of three distributed systems that illustrate the impact of these ML-based optimizations: 1) Curator, a framework that safeguards distributed storage systems' health and performance by scheduling and executing background maintenance tasks, 2) AdaRes, an adaptive system that dynamically adjusts virtual machine resources in virtual execution environments, and 3) Pulpo, a federated system that efficiently trains machine learning models across different data centers. Each system instantiates appropriate ML models for the task at hand, alleviating systems designers from manually tuning rules and handcrafting complex analytical models. Our evaluations on real clusters show how our ML formulations result in improved distributed systems' efficiency and performance.application/pdfen-USnonecontextual banditsdistributed systemsmachine learningoptimizationreinforcement learningComputer scienceComputer science and engineeringOptimizing Distributed Systems using Machine LearningThesis