Delivering Predictable Tail Latency in Data Center Networks

dc.contributor.advisorAnderson, Thomas E
dc.contributor.authorZhao, Kevin
dc.date.accessioned2026-02-05T19:34:19Z
dc.date.available2026-02-05T19:34:19Z
dc.date.issued2026-02-05
dc.date.submitted2025
dc.descriptionThesis (Ph.D.)--University of Washington, 2025
dc.description.abstractModern web services decompose a user request into thousands of RPCs whose slowest 1% dominate end-to-end latency, costing revenue and straining user patience. Operators codify expectations as tail latency SLOs, but meeting them is difficult even in well-run data center networks. Although such networks expose configuration parameters that have a large impact on tail latency, like switch weights, congestion windows, and switch marking thresholds, operators typically set these parameters once and rarely revisit them. When workload characteristics shift, for example in burstiness, traffic mix, or demand patterns, the resulting mismatch between the workload and the network can degrade user-observed performance and cause SLO violations, even in networks that deploy congestion control, traffic engineering, and class-based scheduling. A natural response is to adapt network parameters when workloads change, but existing methods adjust parameters by trial and error, risking intermediate violations and slow convergence in high-dimensional, noisy settings. This dissertation argues that prediction-guided control is an effective technique for delivering predictable tail latency in data center networks. It makes two contributions. First, Parsimon is a scalable tail-latency estimator. Through a series of approximations, Parsimon decouples links and simulates them in parallel, allowing it to run orders of magnitude faster than full-fidelity simulators while retaining distribution-level accuracy. Second, Polyphony embeds such estimators in a closed loop control system to improve network performance. It treats predictions as priors, fuses them with live measurements, and searches safely inside a trust region that resets as conditions drift. In a small testbed on real machines, Polyphony meets tail latency SLOs within minutes, whereas a state-of-the-art model-free tuner fails to converge after an hour. Together, fast prediction and prediction-guided control form a promising toolkit for steering large networks toward better performance for latency-sensitive applications, reducing the cost of provisioning and the risk of unsafe exploration.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherZhao_washington_0250E_29042.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55190
dc.language.isoen_US
dc.rightsCC BY
dc.subjectData center networks
dc.subjectNetwork simulation
dc.subjectPrediction-guided control
dc.subjectService-level objectives
dc.subjectTail latency
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleDelivering Predictable Tail Latency in Data Center Networks
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhao_washington_0250E_29042.pdf
Size:
1.56 MB
Format:
Adobe Portable Document Format