Delivering Predictable Tail Latency in Data Center Networks

Zhao, Kevin

Delivering Predictable Tail Latency in Data Center Networks

dc.contributor.advisor	Anderson, Thomas E
dc.contributor.author	Zhao, Kevin
dc.date.accessioned	2026-02-05T19:34:19Z
dc.date.available	2026-02-05T19:34:19Z
dc.date.issued	2026-02-05
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	Modern web services decompose a user request into thousands of RPCs whose slowest 1% dominate end-to-end latency, costing revenue and straining user patience. Operators codify expectations as tail latency SLOs, but meeting them is difficult even in well-run data center networks. Although such networks expose configuration parameters that have a large impact on tail latency, like switch weights, congestion windows, and switch marking thresholds, operators typically set these parameters once and rarely revisit them. When workload characteristics shift, for example in burstiness, traffic mix, or demand patterns, the resulting mismatch between the workload and the network can degrade user-observed performance and cause SLO violations, even in networks that deploy congestion control, traffic engineering, and class-based scheduling. A natural response is to adapt network parameters when workloads change, but existing methods adjust parameters by trial and error, risking intermediate violations and slow convergence in high-dimensional, noisy settings. This dissertation argues that prediction-guided control is an effective technique for delivering predictable tail latency in data center networks. It makes two contributions. First, Parsimon is a scalable tail-latency estimator. Through a series of approximations, Parsimon decouples links and simulates them in parallel, allowing it to run orders of magnitude faster than full-fidelity simulators while retaining distribution-level accuracy. Second, Polyphony embeds such estimators in a closed loop control system to improve network performance. It treats predictions as priors, fuses them with live measurements, and searches safely inside a trust region that resets as conditions drift. In a small testbed on real machines, Polyphony meets tail latency SLOs within minutes, whereas a state-of-the-art model-free tuner fails to converge after an hour. Together, fast prediction and prediction-guided control form a promising toolkit for steering large networks toward better performance for latency-sensitive applications, reducing the cost of provisioning and the risk of unsafe exploration.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Zhao_washington_0250E_29042.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55190
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Data center networks
dc.subject	Network simulation
dc.subject	Prediction-guided control
dc.subject	Service-level objectives
dc.subject	Tail latency
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Delivering Predictable Tail Latency in Data Center Networks
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Zhao_washington_0250E_29042.pdf
Size:: 1.56 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering