Go to ...[Main Page] [Speeches] [Documents] [What's new] [Feedback] | |
Fault-Tolerance by Replication in Distributed Systemsby André SchiperEmail: schiper@lsesun3.epfl.ch URL: ftp://ftp-lse.epfl.ch/pub/private/people/schiper.html |
Abstract: | Dependability, i.e. reliability and availability, is one of the
biggest trends in software technologies. In the past, it has been
considered acceptable for services to be unavailable because of
failures. This is rapidly changing: the requirement for high
software reliability and availability is continually increasing in
domains such as finance, booking-reservation, industrial control,
telecommunication, etc. One obvious possibility is to build
reliable software on top of fault-tolerant (replicated) hardware.
This may indeed be a viable solution for some application classes,
and has been been successfully pursued by companies such as Tandem
and Stratus. Economic factors have, however, motivated the search
for cheaper software-based fault-tolerance, i.e. fault-tolerance
by software-based replication. While this principle is readily
understood, the techniques required to implement replication are
surprisingly difficult to master. The talk will concentrate on the techniques that have been developed, since the mid-eighties, to implement replicated services. These techniques consider process groups and group communication as the basic abstraction. A replicated service is encapsulated within a group and thus appears as a single entity with respect to the outside world. Interaction between processes within the group as well as those between the service and the client are implemented through group communication primitives, including reliable multicast with different message ordering guarantees. The Isis system developed at Cornell University, has pioneered this approach. The talk will describe the various replication techniques (passive replication, active replication, coordinator-cohort replication), and the group multicast primitives needed to implement these replication techniques. Implementation of group multicast primitives will be briefly discussed, stressing the difficulty to guarantee the safety and liveness properties. Finally, the relationship between the group communication paradigm and classical quorum based techniques (e.g. static voting, dynamic voting), in regards to handling replication, will be mentioned. |
---|---|
Biography: | André SCHIPER has been a professor of Computer Science at the EPFL
since 1985, leading the Operating Systems laboratory. His current
research interests are in the areas of fault-tolerant distributed
systems. He has worked on the new implementation of communication
primitives for the Isis system, developed at Cornell University.
André Schiper has taken part in the ESPRIT Basic Research Projet
BROADCAST, and is member of the CABERNET Network of Excellence
in Distributed Systems. He is also partner in the new Esprit
R&D project OpenDREAMS, whose objective is to integrate various
technologies, including fault-tolerance by replication, within a
CORBA infrastructure. |