Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications
Author
Ramamurty, Bina; Upadhyaya, Shambhu; Bhargava, Bharat
Abstract
An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characterisitics of a concurrent error detection scheme is presented. Message dependency which is the main source of multistep rollback in distributed systems is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analyitcal model, algorithms, and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios are illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework.
Journal
IEEE Transactions on Knowledge and Data Engineering
Publisher
IEEE Computer Society
Publication Date
1999-08-08
Location
A hard-copy of this is in the Papers Cabinet
Subject
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications