Recovering Quorum After Renaming Servers
John Weldon — January 20, 2024
Occasionally, a NATS cluster can lose quorum for various reasons. Here, we’ll look at one specific case, and how to recover from it.
Note on nats-server 2.12+: This recovery procedure was the sanctioned path on nats-server 2.10 and 2.11. As of 2.12.0 ( PR #7038 ), the scale-up recovery path no longer works: a fresh empty node can no longer force itself into the peer set, so the meta election never passes. If you are running 2.12 or later, see Recovery on 2.12 and Later below.
Context
How To Rename Servers in a Cluster
The recommended way to rename NATS servers in a cluster is to rename one at a time. After each rename, the cluster will have a record of both the old name and the new name. The former will appear offline, and the latter should appear online. You should remove the record of the old name before renaming the next server, otherwise, the cluster may, sooner or later, end up with too many faux-offline servers, and will consider itself to have lost quorum.
NATS Helm Chart Caveats - A Brief Diversion
The
values.yaml
file for the
NATS Helm Chart
has an option
to set a
serverNamePrefix
, which you might be tempted to use to
rename the servers in a helm chart deployed cluster.
This setting should only be changed before the first installation.
Once the cluster is up and running, if you change this value, and then upgrade the
helm release, you’ll cause all of the servers in the cluster to be simultaneously renamed.
This will double the number of recorded servers in the cluster (half with the old name, and
half with the new name, per the changed serverNamePrefix).
Consequently, there will not be enough servers active in the cluster to retain quorum.
On nats-server 2.12 and later, reverting the serverNamePrefix change does not unbrick the cluster on its own.
The original-named pods have to actually come back online – empty disks are fine, but the names must match – for the meta election to pass.
See
Recovery on 2.12 and Later
below.
A Case Study
For the sake of this article, we’ll assume a helm chart 1 deployed cluster of three servers. 2
We start by “breaking” this cluster, simply modifying the serverNamePrefix to rename the
servers, and update the release.
First Indication of Trouble
Logs
The first indication of trouble is when you see this WRN warning and INF message in the logs:
[WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[INF] JetStream cluster no metadata leader
Events
Another indication is that the NATS pods fail to progress to a ready state; the NATS container specifically shows that it’s running, but the readiness is false.
The events will show a warning with the message
Readiness probe failed: HTTP probe failed with statuscode: 503
NATS CLI
Using the NATS CLI, running the nats server report jetstream will also show an error;
depending on the cluster state it could be any one of the following:
Before Quorum is Lost
At first, you’ll just see half the servers (with the old name) as being offline:
$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├─────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-1 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-2* │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ │ │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
╰─────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├─────────────────────────────────────────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ ID │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────────────────────────────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ Server name unknown at this time (peerID: Wp0X92Zu) │ Wp0X92Zu │ │ false │ false │ 0s │ 0 │
│ nats-helm-kind-0 │ YMpQSy04 │ │ false │ false │ 19.53s │ 1 │
│ nats-helm-kind-1 │ MGRogjE4 │ │ false │ false │ 0s │ 13 │
│ x-nats-helm-kind-0 │ svvjmHnE │ │ true │ true │ 526ms │ 0 │
│ x-nats-helm-kind-1 │ XCzEfWSa │ │ true │ true │ 525ms │ 0 │
│ x-nats-helm-kind-2 │ XGX0cX6V │ yes │ true │ true │ 0s │ 0 │
╰─────────────────────────────────────────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯
After Quorum is Lost
After the quorum is lost, but before the readiness probes cause the servers to stop responding, for a brief window of time you’ll see the error in the jetstream report:
$ nats server report jetstream
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-1 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-2 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
├────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ │ │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
╰────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯
WARNING: No cluster meta leader found. The cluster expects 6 nodes but only 3 responded. JetStream operation require at least 4 up nodes.
After Servers Stop Responding
Finally, the servers will possibly stop responding, giving you the general error:
$ nats server report jetstream
nats: error: nats: no servers available for connection
command terminated with exit code 1
Recovery on 2.12 and Later
On 2.12 and later, the scale-up recovery path no longer works. A fresh empty node can no longer force itself into the peer set on first contact (see PR #7038 ), so adding a new node to a cluster that has lost quorum will not trigger a meta election.
The safe recovery path is to bring the renamed (or downed) nodes back under their original names. The cluster recognizes peers by name; restoring the original names is what re-establishes quorum.
Revert the Rename
For the canonical case in this article (a bulk rename caused by changing serverNamePrefix),
revert the change in values.yaml and upgrade the helm release:
$ helm upgrade <release-name> nats/nats -f values.yaml
A helm upgrade updates the StatefulSet template in place without deleting pods, so the existing
PVCs stay attached and the originally-named pods come back with their original JetStream data intact.
Once the originally-named nodes are online, the meta election succeeds and quorum is regained.
Verify with:
$ nats server report jetstream
Once a leader is elected, clean up any stale peer entries left over from the failure (see Clean Up Stale Peers below).
When the Original Disks Are Wiped
If the PVCs were deleted or the storage was destroyed but the hostnames can be reused, bring the originally-named nodes back with empty disks. On 2.12+, a peer with an empty disk uses an “empty vote” (introduced by PR #7038 ) that only counts when all peers in the original set vote, so every originally-named node has to be available at the same time for the leader election to pass. If even one original peer cannot come back, this path is also blocked.
Clean Up Stale Peers
After quorum is restored, the meta group may still list peer entries left over from the failure
(for example, the new-prefix server names that were added before the rename was reverted).
Remove them with peer-remove, using the peer ID from nats server report jetstream:
$ nats server cluster peer-remove -f <peer ID>
When Original Names Cannot Be Reclaimed
If the original hostnames cannot be brought back at all (machines truly destroyed, names unrecoverable), there is no in-product recovery path on 2.12 and later at the time of writing. Restore from backup.
Recovery on 2.10 and 2.11
The procedure below relies on the pre-2.12 behavior where a fresh empty node could force itself into the peer set on first contact; that path is closed on 2.12 and later (see the note at the top and Recovery on 2.12 and Later ).
Regain Quorum
To recover, the cluster must first regain quorum. In this case, the cluster thinks that there are six nodes 3 in the cluster, so to regain quorum there needs to be a minimum of four nodes 3 reachable from each other.
The way to do this is to add one more server, which will allow quorum to be regained. You can do this by scaling the stateful set:
$ kubectl scale --replicas=4 statefulset/nats-helm-kind
Which, once complete, should result in quorum being regained:
$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├─────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-1 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-2* │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ x-nats-helm-kind-3 │ nats-helm-kind │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ │ │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
╰─────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├─────────────────────────────────────────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ ID │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────────────────────────────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ Server name unknown at this time (peerID: Wp0X92Zu) │ Wp0X92Zu │ │ false │ false │ 0s │ 0 │
│ nats-helm-kind-0 │ YMpQSy04 │ │ false │ false │ 47m27s │ 6 │
│ nats-helm-kind-1 │ MGRogjE4 │ │ false │ false │ 0s │ 18 │
│ x-nats-helm-kind-0 │ svvjmHnE │ │ true │ true │ 461ms │ 0 │
│ x-nats-helm-kind-1 │ XCzEfWSa │ │ true │ true │ 461ms │ 0 │
│ x-nats-helm-kind-2 │ XGX0cX6V │ yes │ true │ true │ 0s │ 0 │
│ x-nats-helm-kind-3 │ G7oD67bf │ │ true │ true │ 461ms │ 0 │
╰─────────────────────────────────────────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯
Remove Offline/Old Servers
Now you can begin cleaning up the old server records. You can do this either with the CLI or by using NATS directly.
CLI
Using the CLI 4:
$ nats server cluster peer-remove -f <peer ID>
Using NATS Directly
You can also remove a peer directly by publishing to the JetStream API subjects:
$ nats publish '$JS.API.SERVER.REMOVE' '{"peer":"","peer_id":"YMpQSy04"}'
Which will send a response message on the same channel that confirms the action:
{
"type": "io.nats.jetstream.api.v1.meta_server_remove_response",
"success": true
}
Remove Temporarily Added Server
Scale Back Down to Three Servers
Now that the number of servers is four instead of six, it’s safe to scale back down to three servers, and then remove the record of the server we temporarily added.
$ kubectl scale --replicas=3 statefulset/nats-helm-kind
Remove Peer
$ nats server cluster peer-remove -f G7oD67bf
Success!
Finally, the state of the cluster should be restored now, with three servers.
About The Author
John Weldon is a Customer Solutions Architect at Synadia Communications .
The name of the helm release in this article is
nats-helm-kind; it could be anything, often the default is justnats↩︎You can replicate the kind environment I used in this article, by referring to this repository. ↩︎
NATS Servers are also called “nodes” - sometimes interchangeably. ↩︎
Throughout this article I use nats-box to execute nats commands; the simple way to do it from the command line is:
kubectl exec -it deployment/nats-helm-kind-box -- nats <command> <args>nats-box is deployed by default in the nats helm chart.
For simplicity, I’ll just show the plain NATS command in the examples. ↩︎
Back to Blog