81 lines
4.1 KiB
Markdown
81 lines
4.1 KiB
Markdown
This command, that can only be sent to a Redis Cluster replica node, forces the
|
|
replica to start a manual failover of its master instance.
|
|
|
|
A manual failover is a special kind of failover that is usually executed when
|
|
there are no actual failures, but we wish to swap the current master with one of
|
|
its replicas (which is the node we send the command to), in a safe way, without
|
|
any window for data loss. It works in the following way:
|
|
|
|
1. The replica tells the master to stop processing queries from clients.
|
|
2. The master replies to the replica with the current _replication offset_.
|
|
3. The replica waits for the replication offset to match on its side, to make
|
|
sure it processed all the data from the master before it continues.
|
|
4. The replica starts a failover, obtains a new configuration epoch from the
|
|
majority of the masters, and broadcasts the new configuration.
|
|
5. The old master receives the configuration update: unblocks its clients and
|
|
starts replying with redirection messages so that they'll continue the chat
|
|
with the new master.
|
|
|
|
This way clients are moved away from the old master to the new master atomically
|
|
and only when the replica that is turning into the new master has processed all
|
|
of the replication stream from the old master.
|
|
|
|
## FORCE option: manual failover when the master is down
|
|
|
|
The command behavior can be modified by two options: **FORCE** and **TAKEOVER**.
|
|
|
|
If the **FORCE** option is given, the replica does not perform any handshake
|
|
with the master, that may be not reachable, but instead just starts a failover
|
|
ASAP starting from point 4. This is useful when we want to start a manual
|
|
failover while the master is no longer reachable.
|
|
|
|
However using **FORCE** we still need the majority of masters to be available in
|
|
order to authorize the failover and generate a new configuration epoch for the
|
|
replica that is going to become master.
|
|
|
|
## TAKEOVER option: manual failover without cluster consensus
|
|
|
|
There are situations where this is not enough, and we want a replica to failover
|
|
without any agreement with the rest of the cluster. A real world use case for
|
|
this is to mass promote replicas in a different data center to masters in order
|
|
to perform a data center switch, while all the masters are down or partitioned
|
|
away.
|
|
|
|
The **TAKEOVER** option implies everything **FORCE** implies, but also does not
|
|
uses any cluster authorization in order to failover. A replica receiving
|
|
`CLUSTER FAILOVER TAKEOVER` will instead:
|
|
|
|
1. Generate a new `configEpoch` unilaterally, just taking the current greatest
|
|
epoch available and incrementing it if its local configuration epoch is not
|
|
already the greatest.
|
|
2. Assign itself all the hash slots of its master, and propagate the new
|
|
configuration to every node which is reachable ASAP, and eventually to every
|
|
other node.
|
|
|
|
Note that **TAKEOVER violates the last-failover-wins principle** of Redis
|
|
Cluster, since the configuration epoch generated by the replica violates the
|
|
normal generation of configuration epochs in several ways:
|
|
|
|
1. There is no guarantee that it is actually the higher configuration epoch,
|
|
since, for example, we can use the **TAKEOVER** option within a minority, nor
|
|
any message exchange is performed to generate the new configuration epoch.
|
|
2. If we generate a configuration epoch which happens to collide with another
|
|
instance, eventually our configuration epoch, or the one of another instance
|
|
with our same epoch, will be moved away using the _configuration epoch
|
|
collision resolution algorithm_.
|
|
|
|
Because of this the **TAKEOVER** option should be used with care.
|
|
|
|
## Implementation details and notes
|
|
|
|
`CLUSTER FAILOVER`, unless the **TAKEOVER** option is specified, does not
|
|
execute a failover synchronously, it only _schedules_ a manual failover,
|
|
bypassing the failure detection stage, so to check if the failover actually
|
|
happened, `CLUSTER NODES` or other means should be used in order to verify that
|
|
the state of the cluster changes after some time the command was sent.
|
|
|
|
@return
|
|
|
|
@simple-string-reply: `OK` if the command was accepted and a manual failover is
|
|
going to be attempted. An error if the operation cannot be executed, for example
|
|
if we are talking with a node which is already a master.
|