Delivering Predictable Tail Latency in Data Center Networks
| dc.contributor.advisor | Anderson, Thomas E | |
| dc.contributor.author | Zhao, Kevin | |
| dc.date.accessioned | 2026-02-05T19:34:19Z | |
| dc.date.available | 2026-02-05T19:34:19Z | |
| dc.date.issued | 2026-02-05 | |
| dc.date.submitted | 2025 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2025 | |
| dc.description.abstract | Modern web services decompose a user request into thousands of RPCs whose slowest 1% dominate end-to-end latency, costing revenue and straining user patience. Operators codify expectations as tail latency SLOs, but meeting them is difficult even in well-run data center networks. Although such networks expose configuration parameters that have a large impact on tail latency, like switch weights, congestion windows, and switch marking thresholds, operators typically set these parameters once and rarely revisit them. When workload characteristics shift, for example in burstiness, traffic mix, or demand patterns, the resulting mismatch between the workload and the network can degrade user-observed performance and cause SLO violations, even in networks that deploy congestion control, traffic engineering, and class-based scheduling. A natural response is to adapt network parameters when workloads change, but existing methods adjust parameters by trial and error, risking intermediate violations and slow convergence in high-dimensional, noisy settings. This dissertation argues that prediction-guided control is an effective technique for delivering predictable tail latency in data center networks. It makes two contributions. First, Parsimon is a scalable tail-latency estimator. Through a series of approximations, Parsimon decouples links and simulates them in parallel, allowing it to run orders of magnitude faster than full-fidelity simulators while retaining distribution-level accuracy. Second, Polyphony embeds such estimators in a closed loop control system to improve network performance. It treats predictions as priors, fuses them with live measurements, and searches safely inside a trust region that resets as conditions drift. In a small testbed on real machines, Polyphony meets tail latency SLOs within minutes, whereas a state-of-the-art model-free tuner fails to converge after an hour. Together, fast prediction and prediction-guided control form a promising toolkit for steering large networks toward better performance for latency-sensitive applications, reducing the cost of provisioning and the risk of unsafe exploration. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Zhao_washington_0250E_29042.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/55190 | |
| dc.language.iso | en_US | |
| dc.rights | CC BY | |
| dc.subject | Data center networks | |
| dc.subject | Network simulation | |
| dc.subject | Prediction-guided control | |
| dc.subject | Service-level objectives | |
| dc.subject | Tail latency | |
| dc.subject | Computer science | |
| dc.subject.other | Computer science and engineering | |
| dc.title | Delivering Predictable Tail Latency in Data Center Networks | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Zhao_washington_0250E_29042.pdf
- Size:
- 1.56 MB
- Format:
- Adobe Portable Document Format
