On the use and performance of communication primitives in software controlled cache-coherent cluster architectures
Two recent trends are affecting the design of medium-scale shared-memory multi-processors. The first is the use of nodes which themselves consist of clusters of processors. Clusters, already available as commodity parts, not only make powerful nodes, they also let the system scale up gracefully. The second trend is the use of programmable protocol processors and software for maintaining cache coherence to shorten the hardware design cycle and to provide flexibility and extensibility.One problem arising from software cache coherence is that remote memory accesses suffer a longer latency than with a pure hardware scheme. Another issue raised by software schemes in cluster environments is that of contention on the protocol processor due to the high service demand for this device.Our solution to the first problem offers users or compiler writers a set of explicit communication primitives to provide hints for moving data properly and promptly. The communication primitives, running on protocol processors, introduce a flavor of message-passing and permit protocol optimization. To the second issue, we investigate three architectural choices that strive to achieve resource balance: (1) selecting an appropriate cluster size to control resource sharing, (2) adding a remote cache (per node) to keep remote data in clusters, and (3) adding a forwarding logic to reduce the load on the protocol processor and to speed up the processing of simple messages.This dissertation studies how the overhead of a software scheme and its contention on the protocol processor can be reduced by various combinations of the design options and how the software overhead can be further hidden by the communication primitives. In the absence of communication primitives, we employ an MVA-based analytical model to estimate the protocol processor's contention and overall performance for a fast turn-round. When communication primitives are present, we employ simulation method. We find that the software implementation supplemented with remote cache and forwarding logic can deliver a performance competitive with the rigid and pure hardware scheme. With the judicious use of communication primitives, the enhanced software scheme can improve performance beyond the limit of the hardware implementation. In addition, the software cache coherence is more flexible, scalable and easier to optimize.