Towards More Efficient Communication for Distributed Learning Systems

Luo, Liang

Towards More Efficient Communication for Distributed Learning Systems

Files

Luo_washington_0250E_21986.pdf (4.89 MB)

Date

2020-10-26

Authors

Luo, Liang

Abstract

The explosion of data volume and ever-increasing speed of accelerators shift the bottleneckof large-scale distributed training tasks from computation to communication. We observesignificant pressure on the communication backends of various mainstream learning systemsin multiple environments when running such tasks. Achieving efficient large scale learningrelies on more effective communication planes. We provide detailed analysis that root-causes the bottlenecks affecting the communica-tion efficiency of these systems in the context of different environments. We pinpoint suchbottlenecks from the software, hardware and network infrastructure stacks. We show how these obstacles can be overcome with a systematic codesign of a streamlinedcommunication stack, a balanced hardware and cluster configuration with the distributedtraining workload, together with awareness of network topology and environment. We showthis series of approaches, named Parameter Box, Parameter Hub, Parameter Link along with Cloud Collectives, accelerate distributed training from small clusters to datacenters and allthe way to the commercial clouds while providing varying degrees of customization to suitdifferent needs.