What you have to know about Consul and how to beat the outage problem
06 Oct 2015For those of you who are opened to the experimenting, you can start launching 3 consul docker instances and playground around it.
Give your servers some configuration, i.e.:
{
"leave_on_terminate": false,
"skip_leave_on_interrupt": false,
"bootstrap_expect": 3,
"server": true,
"retry_join": [ ... ],
"rejoin_after_leave": true
}
Start consuls:
$ docker run -P -v /tmp/node1:/data --name node1 -h node1 -i -t --entrypoint=/bin/bash progrium/consul
$$ consul agent -config-file=/data/config -data-dir=/data
To the rest, you should know the following:
How Consul manages cluster membership
Consul does the cluster management in the 3 levels:
/---------------------\
| L3. Serf | <-> WAN interconnectivity
| | ^
| - peers list | |
| WAN connected nodes | (broadcasting)|
| | |
\---------------------/ |
/-----------------------------------\ v
| L2. Serf | <-> LAN interconn
| | | - node join
| - peers list | | - node leave
| LAN connected servers and clients | | - node reap
| - node1 (srv) | | - ign: update
| - node2 (srv) | | - ign: user
| - node3 (cli) | | - ign: qry
| | |
\-----------------------------------/ |
/-------------------------\ v
| L1. Raft | <-> Quorum consensus
| | Leadeship election
| - LAN srvs peers list | (leader do)
| consistent fsm and db | |
| journal / log | (r/w)|
| followers streaming | |
| | |
\-------------------------/ v
/----------------------------------------------\
| KV DB <-> Key-Value persistent storage |
| |
| - consistent writes, con/stale reads allowed |
\----------------------------------------------/
/--------------------\
| Events ss |
| |+---> LAN/WAN broadcast
| - lan broadcasting |
\--------------------/
What to look at
- Consul
serversare not the same as Consulclients. - Serf manages membership questions like node joining / leaving.
- Raft manages leadership questions and consistency.
- Each level of management has its own
peers list(nodes joined or leaved). The key word is - they are separate and handled separately. - What you want is to keep quorum in a fine state (leader is present) - thats why you should be interested in the state of the Raft peers list (which can differ from what LAN level serf currently observes).
- The point of having Raft peers list separate is to have a list of nodes which are deciding who will be the leader in the quorum, because LAN`s list for example also contains all clients inside.
- Raft peers list controlled by the 1) Leader (actual join / leave management) 2) LAN Serf layer events (new, failed, left nodes events).
- Raft (L1) and Serf (L2) peers lists may be unsynced in a way of containing totally different nodes. In that case, if your raft peers list contains some old nodes ip’s which missing in the lan`s serf, then you have to go and get rid of them manually.
- You can add or remove nodes from the quorum until it is in the consistent (leader present) state and there are at least 2 nodes present.
The whole point of the Outage page is a) you are responsible for the quorum health, b) you are responsible for the right values in the raft peers list, c) if something went wrong, fix raft peers manually at every server node and restart the whole cluster.
For quorum to be healthy it is necessary for every node to know the neighbours that are
kept in the raft peers list at raft/peers.json. You should have at least 2
known nodes there (i.e. self + another one at least, and self btw may be omitted)
in order for quorum to be able to elect a leader on startup.
Available node states are:
-
bootstrapis a bootstrapping mode in which you allow a single consul node to elect it self as a leader of its own single node cluster. You are not allowed to have another nodes joined / found in the cluster started in this mode. Both servers joined in this mode will found each other in the serf peers list, but not in raft - so they will not build up a quorum. Though they will keep holding self leadership each. -
bootstrap expectis another bootstrap mode which allows multi nodes quorum. It introducesexpectationnumber, which is the number of server nodes to wait for quorum start. No actions in this mode will be taken until allexpectednodes appeared up and known. -
Initializedmeans literally bootstrap happened once ago.raft/andserf/path should be present representing current node state. In this mode, in the case of multi nodes quorum, it can be built up from ground up having at least 2 nodes in raft peers with no respect to originalexpectnumber (if it was for example greater than you end up).
Available operational modes are:
-
Single node clustermust be explicitly enabled with a special flag (bootstrap) which restricts any other configuration (more than 1 known server) for this specific node. It can not be part of any other cluster. -
Multi nodes clustercan be enabled by omitting thesinglemode flags. Thanks to Raft -leadercan be elected even in the case of at least 2 nodes are present. Its fine for Consul. It can go with 2. Its not recommended though. Thus, multi nodes cluster can’t go under 2 nodes quorum. In that state it will lose leadership because single node clusters in this mode are not allowed.From the other side, if node came to a 1 single node state, it means you have a brain split situation in which it can’t operate further, as it will suppose the majority of the nodes is somewhere there.
What are your bootstrap and expect server values
In order to restore your cluster (raft quorum) you need to know/ be sure you have consistent set of those two (bootstrap mode, expect count) values across your set of consuls servers.
Where are your raft/peers.json, serf/local.snapshot locates and what they contain
In the situation of the outage you will want to find out in what state your cluster is. It can be done through looking across your serf and raft peers list.
-
raft/peers.jsoncontains currently known raft peers set - literally the nodes that take part in the quorum and consensus. -
serf/*.snapshotcontains journal of serf protocol progressing through time. You can be interested in it to find out what events took place during the time. -
./consul members -detailedor/v1/catalog/nodeswill show you current serf (lan / wan) peers list.$ docker exec node1 consul members -detailed -
/v1/status/leaderwill give you current leader status in terms of elected node address.$ curl -v http://127.0.0.1:8500/v1/status/leader
Configuration of leave event on termination
There are 2 parameters that effect behaviour of cluster members on shutdown and restart. You must be interested in considering those if you willing to keep quorum in a fixed state (i.e. w/o leaving nodes mostly) or restart machines often and in parallel.
It should be decided which mode of operation is preferred: either you allow consul to
auto leave quorum on shutdown, or you count on your self manually managing peers left
with leave or force-leave.
If you allow consul to publish leaving event on a node shutdown and then pushes whole cluster to restart you will end up fully broken cluster with no quorum at all.
- leave_on_terminate false by default
- skip_leave_on_interrupt false by default (consider this)
Default response to the signals
- SIGHUP - Reload, always. Only some things can be reloaded,
- SIGINT - Graceful leave by default, can be configured to be non-graceful,
- SIGTERM - Non-graceful leave. Can be configured to be graceful.
What to do in the outage situation of Multi-Server Cluster
This is kind of situation in which you end up with No cluster leader and
it cant heal it self. That means that you lost your quorum and the quorum
lost the majority. Actually in the multi node cluster mode it will mean
that your raft peers links are completely wrong (i.e. case when each node
knows about it self only).
Read the Outage doc.
In order to make your cluster work again, you have to put it into the consistent state in which the leader and the quorum are present:
- Get rid of missing/dead peers from raft peers list,
- Make rest of good nodes to know about each other, so they should be contained in each others raft peers lists.
- Rebuild the quorum and elect the leader
To achieve this follow this steps:
-
Detect do you have a quorum and the leader present
$ curl -v http://127.0.0.1:8500/v1/status/leader or $ curl -v http://127.0.0.1:8500/v1/kv/?keys=&separator=/ -
If your consul nodes are up and joined together, you can verify they see each other checking serf peers list
$ consul members -detailed -
Check raft peers list to contain right set of nodes across every consul server that should form the quorum. By right I mean:
3.1. no dead, left, failed servers are present
3.2. all server nodes are in there and seen by serf (check topic 2) as well.
In order to do it, go to
raft/peers.json, open, read, add and/or remove nodes, save. Repeat on every node. If you ran into this situations of unsynced raft peers across the cluster, you have to stop the nodes before manually fixing raft peers list.After you are sure
raft/peers.jsonfiles are good, start everything up. Consul will succeed in building the quorum and electing the new leader.Example of what your raft peers file should look like for 3 consul servers:
["172.17.0.30:8300","172.17.0.31:8300","172.17.0.32:8300"]If you have only
single entrythere,[]ornullit rather means it is wrong.
About the outage doc
It missing some points in the part of manually fixing raft/peers.json.
-
It is not enough to just get rid of failed nodes. You have to make sure you have all other healthy nodes in there. Without it you will not build up a quorum. Without a working quorum, its impossible to:
-
do what they claim next:
If any servers managed to perform a graceful leave, you may need to have then rejoin the cluster using the join command`
before doing that, you should fix your quorum (or add/restore them manually straight to the json file).
In the case you have to fix your peers json files manually, it makes sense to add everything you need at once.
Case we ran into
-
We have restarted the whole cluster in parallel
-
skip_leave_on_interruptwas set to false, so every node issuedleaveevent to the cluster, so they end up with cleared raft peers list each. -
After restart they of course failed to build a quorum without any neighbours in the raft list.
To fix it, we restored original raft/peers.json file on each server node and
restarted the cluster.
Observable Consul behaviour on term
All Consul servers have consul service embedded. All Consul nodes have
Serf Health Status health check which does not exposed by default for client
node types.
-
if
ServernodeLeft- it leaves/v1/catalog/nodeslist,/v1/health/service/consuland/health/node/{name}, -
if
ServernodeFailed- it does not leave/v1/catalog/nodeslist,/v1/health/service/consuland/health/node/{name}having:([{"Node":"node3","CheckID":"serfHealth","Name":"Serf Health Status","Status":"critical","Notes":"","Output":"Agent not live or unreachable","ServiceID":"","ServiceName":""}]), -
if
ClientnodeLeft- it leaves/v1/catalog/nodeslist,/v1/health/service/consuland/health/node/{name}, -
if
ClientnodeFailed- it does not leave/v1/catalog/nodeslist,/v1/health/service/consuland/health/node/{name}having:[{"Node":"node4","CheckID":"serfHealth","Name":"Serf Health Status","Status":"critical","Notes":"","Output":"Agent not live or unreachable","ServiceID":"","ServiceName":""}].
Tips
-
When cluster leader is present you can join / leave nodes live - it will auto update peers json,
-
You can add new servers straight to raft peers list (you can add new servers with new ips right into json),
-
If you have unsynced raft peers set and serf and there are some nodes in raft missing in serf the only option to remove them will be to manually cut them from the
raft/peers.json.force-leaveworks only for nodes present in serf peers list first (at least for the state of version 0.5.2).