Towards More Efficient Communication for Distributed Learning Systems

Loading...
Thumbnail Image

Authors

Luo, Liang

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The explosion of data volume and ever-increasing speed of accelerators shift the bottleneckof large-scale distributed training tasks from computation to communication. We observesignificant pressure on the communication backends of various mainstream learning systemsin multiple environments when running such tasks. Achieving efficient large scale learningrelies on more effective communication planes. We provide detailed analysis that root-causes the bottlenecks affecting the communica-tion efficiency of these systems in the context of different environments. We pinpoint suchbottlenecks from the software, hardware and network infrastructure stacks. We show how these obstacles can be overcome with a systematic codesign of a streamlinedcommunication stack, a balanced hardware and cluster configuration with the distributedtraining workload, together with awareness of network topology and environment. We showthis series of approaches, named Parameter Box, Parameter Hub, Parameter Link along with Cloud Collectives, accelerate distributed training from small clusters to datacenters and allthe way to the commercial clouds while providing varying degrees of customization to suitdifferent needs.

Description

Thesis (Ph.D.)--University of Washington, 2020

Citation

DOI