Accelerating Collective Communication for Distributed Machine Learning

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Collective communication has emerged as a cornerstone of distributed machine learning, enabling datacenter-scale clusters of accelerators to collaboratively train or serve large language models. However, it has also become a significant performance bottleneck, impeding the efficient utilization and scalability of hardware resources. This dissertation focuses on optimizing collective communication for machine learning hardware and workloads, approaching the challenge from the perspectives of network topology, communication scheduling, and parallelization strategies. We first present our work on co-optimizing network topology and communication scheduling for direct-connect optical circuit networks. We propose expansion techniques and a linear programming-based schedule generation algorithm to synthesize efficient large-scale topologies and schedules, thereby forming a Pareto frontier of the latency-throughput trade-off. Our approach enables efficient collective communication on low-diameter topologies. Then, we introduce ForestColl, a schedule generation algorithm capable of producing throughput-optimal schedules for any network topology in polynomial time. ForestColl leverages prior graph-theoretical results to construct spanning trees for collective communications. It is the first work to achieve throughput optimality for collective communications while delivering orders-of-magnitude speedups in schedule generation compared to prior approaches. Finally, we outline our future work on automating the search for parallelization and optimization strategies in machine learning training. We propose a strategy grounded in the sharding and processing states of tensors within the compiled computation graph. By adopting a unified view of all tensor types, we propose a method that can discover optimal parallelization and optimization strategies through the determination of abstract tensor states.

Description

Thesis (Ph.D.)--University of Washington, 2026

Citation

DOI