![]()
![]()
Designs for distributed systems must consider the possibility that failures will arise, and must adopt specific failure detection and recovery strategies. Many distributed-object systems employ simple techniques to detect and report failures, requiring applications to decide upon appropriate recovery strategies. In this paper, we investigate the ability of selected designs for service-discovery protocols to detect and recover from failure of remote services when used to support real-time distributed control applications. We model two architectures (two-party and three-party) underlying most commercial service-discovery systems. We use simulation to quantify functional effectiveness and efficiency achieved by the two architectures as the rate of failure increases for remote services. We further decompose non-functional periods into failure-detection latency and restoration latency. Our quantitative measurements suggest that a two-party architecture yields better robustness than a three-party architecture. We discuss the underlying causes for this outcome.
![]()