Practical, Efficient, and Reliable Data Center Communication
MetadataShow full item record
Data center communication is a key aspect of cloud computing, as it interconnects all the data center resources. It facilitates resource sharing among servers and has become critical for constructing distributed systems that power today’s popular cloud applications, such as databases, transaction systems, and video streaming services. There are three primary requirements when designing and implementing data center communication: reliability, efficiency, and virtualization. Reliability is important because companies depend on the cloud to run their critical business functions, so even the slightest downtime can result in significant lost productivity and revenue. Efficiency means cloud providers can provide services with fewer resources, and thus can lower costs for customers. Virtualization allows cloud customers to move unmodified applications to the cloud enabling more customers to benefit from the reliability and efficiency of the cloud, while also providing resource multiplexity—multiple customers can use the cloud simultaneously with strong security isolation. One major trend in data center communication is the rapid increase in bandwidth. To provide high network bandwidth, cloud providers build large-scale optical-based data center networks and use high-speed network interface cards to connect to servers. This trend poses several challenges in providing a practical, efficient and reliable data center communication system. Data center networks using optical communication technologies are expensive to build and difficult to debug due to gray failures and over-engineering. Providing virtualization support at high speeds incurs high processing overheads due to additional packet handling in operating systems. This thesis offers new techniques to achieve greater reliability and efficiency for data center communication. Our contribution is the design, implementation, and evaluation of three systems: (1) CorrOpt, a data center network monitoring and failure mitigation system that reduces packet corruption errors by three to six orders of magnitude, (2) RAIL, a network architecture that reduces the total cost of ownership of the data center network by up to 44%, and (3) Slim, an operating system kernel design that reduces the processing overheads of container network virtualization by up to 66%.