Failure Diagnosis for Datacenter Applications

dc.contributor.advisorAnderson, Thomas E.
dc.contributor.advisorKrishnamurthy, Arvind
dc.contributor.authorZhang, Qiao
dc.date.accessioned2018-07-31T21:11:03Z
dc.date.available2018-07-31T21:11:03Z
dc.date.issued2018-07-31
dc.date.submitted2018
dc.descriptionThesis (Ph.D.)--University of Washington, 2018
dc.description.abstractFast and accurate failure diagnosis remains a major challenge for datacenter operators. Current datacenter applications are increasingly architected around loosely-coupled modular components: each component can scale and evolve independently. However, when application failures occur, they become much harder to detect and localize. The challenges are three-fold: complex component dependency, gray failures, and unpredictable component behaviors. My thesis is that fast and accurate failure diagnosis for datacenter applications is possible using three key ideas: (1) a global view of component interactions and dependencies, (2) a penalized-regression-based failure localization algorithm that localizes both fail-stop and gray failures, and (3) a network architecture that produces predictable routes, simplifying failure localization without sacrificing load balancing and other network features. I present two complementary systems to demonstrate this. The first, Deepview, is a system that can localize virtual hard disk (VHD) failures in Infrastructure-as-a-Service clouds. I show that Deepview localizes VHD failures accurately and quickly to compute, storage and network components in production at Microsoft Azure. The second, Volur, is a network architecture that makes in-network routing predictable to the end-hosts. I show that Volur accurately localizes non-fail-stop link or switch failures and approximates state-of-the-art dynamic load balancing schemes.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherZhang_washington_0250E_18482.pdf
dc.identifier.urihttp://hdl.handle.net/1773/42264
dc.language.isoen_US
dc.rightsCC BY
dc.subjectcloud computing
dc.subjectdatacenter applications
dc.subjectdatacenter networks
dc.subjectdistributed systems
dc.subjectfailure diagnosis
dc.subjectfailure localization
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleFailure Diagnosis for Datacenter Applications
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhang_washington_0250E_18482.pdf
Size:
1.29 MB
Format:
Adobe Portable Document Format