Possible System Failure Modes in Distributed Database / Types of System Failures in Distributed Database / How does 2PC protocol handles failures in distributed database?
The
known errors or failures like software errors, hardware failures, hard disk
failures, and power failures are very common in both centralized database
system and distributed database systems. Apart from these common failures, a
distributed database system may suffer from some of the failures as listed
below;
Failure
of a site
- Distributed database consists of two or more servers. These servers are
otherwise called as site. Any of these sites might fail. Though it is a
hardware or software failure, as a distributed system it must be treated
differently.
Loss
of messages
- The messages which are shared in between a set of sites might be lost. TCP/IP
protocols are responsible to handle these losses.
Failure
of a communication link
- A connection/communication links between a set of sites might be failed. In such
case, the distributed database system may try to identify an alternate route to
send the messages.
Network
partition
- A distributed database system is said to be partitioned if it has two or more
subsystems. A subsystem may be a set of one or more sites which has one
connection to the other subsystems. For example, consider a distributed database
system which manages sites at three different college campuses. Every campus
may be internally having more sites. But they are connected to other campuses
through a single connection. Now the problem is, if this connection is failed,
the distributed database system cannot differentiate or diagnose the actual
problem. The failure can be treated as Failure of a site, loss of messages, or
a communication link failure.
Among
all the failures discussed above, Failure of a site and Network partition need
extra care when handling failures in a distributed database.
Please
recall from the post Distributed Transactions, the various components of
Transaction System Structure. As shown in the figure below, every site has its
own Transaction
Coordinator, and Transaction Manager. In distributed
database systems, the resources are shared among many sites. Hence, the site
which initiates a transaction T may be treated as coordinating site (the
Transaction coordinator [TC] is responsible). The other sties which are
participating in the process of completing the transaction T may be called as
participating sites (the Transaction managers [TM] of those sites are
responsible). So, the failure of a site
may be treated in two difference sense; failure of a participating site and
failure of a coordinator.
Figure 1 - Transaction System Structure in a Distributed Database |
The
links that are given below will take you to the posts which about handling
failures in distributed database.
********