Improving Fault Tolerance and Performance of Data Center Networks
MetadataShow full item record
Data center networks are a key component to the explosive growth of cloud computing---enabling the utilization of tens to hundreds of thousands of co-located servers for large-scale computing and services. As applications and data sets continue to grow rapidly, the challenge for data center networks is to keep pace---by providing enough bandwidth while also lowering costs, increasing flexibility, and maintaining reliability. My thesis is that a key part of the answer is the network's wiring topology: topology has foundational cross-layer effects, and a small amount of intentional asymmetry in the topology can help data center networks meet that challenge. I present two complementary innovations that demonstrate this. The first, F10, is a co-design of the network topology and failover protocols to provide efficient, near-instantaneous, fine-grained, and localized recovery and rebalancing for common-case network failures. My results show that following network link and switch failures, F10 has 1/7th the packet loss of current schemes. The second innovation, Subways, proposes and evaluates a new method to add network capacity by connecting multiple network links per server in an overlapping topology. Using a simulation-based methodology, my work shows that Subways offers substantial performance benefits for popular application workloads: up to a 3.1x speedup in MapReduce and a 2.5x throughput improvement in memcache for a fixed average request latency, relative to an equivalent-bandwidth network that differs only in its wiring.