Efficient replication management in distributed systems
Rabinovich, Michael, 1955-
MetadataShow full item record
Replication is a critical aspect of large-scale distributed systems. Without it, the size of a system is limited by factors such as the increased risk of component failures, the overloading of popular services, and access latency to remote parts of the system.Replicating alleviates these problems by allowing service to continue despite failures using remaining replicas, and by distributing requests for a given service among multiple servers. However, replicated systems incur significant performance overhead for maintaining multiple replicas and keeping them mutually consistent. Managing replication efficiently is therefore important in building large-scale distributed systems.This dissertation concentrates on quorum-based replication management. It proposes several ways to efficiently manage replication in systems using different types of quorums.First, we study structure-based quorums, which are attractive because they result in low-overhead replica management when the number of replicas is high. We propose a way to significantly improve system availability in protocols using these quorums. We also study in depth the performance of a particular class of these quorums based on a grid structure.Second, for voting-based quorums, we propose a way to include new/repaired replicas into the system and exclude failed/disconnected/deactivated replicas, asynchronously with user operations. Thus, such reconfiguration does not involve any system interruption and can be done more freely for various purposes, e.g., improving data availability, migrating data, or redistributing load.Third, we propose a way to efficiently manage read-one-write-all (ROWA) replication. Protocols based on ROWA quorums maintain consistency by reading one copy of the data and writing all copies. They are especially attractive because they provide for highly efficient and fault-tolerant read operations, and because many services and data in distributed systems fall under the mostly-read category. The obvious weakness of ROWA, which has prevented it from being widely used in commercial wide-area networks, is lack of fault-tolerance for writes. This thesis proposes a new ROWA protocol that provides fault-tolerant write operations without any significant deterioration of read properties. This protocol offers immediate benefits to current systems.Finally, the thesis studies some issues that are common for all types of quorums, including semantics of write operations and efficient protocols for transaction commitment.