Performance of Service-Discovery Architectures in Response to Node Failures

Christopher Dabrowski, Kevin Mills, and Andrew Rukhin

Designs for distributed systems must consider the possibility that failures will arise, and must adopt specific failure detection and recovery strategies. Many distributed-object systems employ simple techniques to detect and report failures, requiring applications to decide upon appropriate recovery strategies. In this paper, we investigate the ability of selected designs for service-discovery protocols to detect and recover from failure of remote services when used to support real-time distributed control applications. We model two architectures (two-party and three-party) underlying most commercial service-discovery systems. We use simulation to quantify functional effectiveness and efficiency achieved by the two architectures as the rate of failure increases for remote services. We further decompose non-functional periods into failure-detection latency and restoration latency. Our quantitative measurements suggest that a two-party architecture yields better robustness than a three-party architecture. We discuss the underlying causes for this outcome.

Home Up