Abstract
In an asynchronous distributed computation, processes may fail and restart
from saved state. A protocol for "optimistic rollback recovery" must
recover the sytem when other processes may depend on lost states at
failed processes. Previous work has used forms of partial order clocks to
track potential causality. Our research addresses two crucial short-
comings: the rollback problem also involves tracking a second level of
partial order time (potential knowledge of failures and rollbacks), and
protocols based on partial order clocks are open to inherent security
and privacy risks. We have developed a "distributed time" framework that
provides the tools for multiple levels of time abstraction, and for
identifying and solving the corresponding security and privacy risks.
This paper applies our framework to the rollback problem. We derive a
new optimistic rollback recovery protocol that provides "completely
asynchronous" recovery (thus directly supporting concurrent recovery
and tolerating network partitions) and that enables processes to take
full advantage of their maximum potential knowledge of orphans (thus
reducing the worst case bound on asynchronous recovery after a single
failure from exponetial to at most one rollback per process). By
explicitly tracking and utilizing both levels of partial order time,
our protocol substantially improves on previous work in optimistic
recovery. Our work also provides a foundation for incorporating
security and privacy in optimistic rollback recovery.