Recovering Quorum After Renaming Servers

John Weldon — January 20, 2024

Occasionally, a NATS cluster can lose quorum for various reasons. Here, we’ll look at one specific case, and how to recover from it.

Note on nats-server 2.12+: This recovery procedure was the sanctioned path on nats-server 2.10 and 2.11. As of 2.12.0 ( PR #7038 ), the scale-up recovery path no longer works: a fresh empty node can no longer force itself into the peer set, so the meta election never passes. If you are running 2.12 or later, see Recovery on 2.12 and Later below.

Context

How To Rename Servers in a Cluster

The recommended way to rename NATS servers in a cluster is to rename one at a time. After each rename, the cluster will have a record of both the old name and the new name. The former will appear offline, and the latter should appear online. You should remove the record of the old name before renaming the next server, otherwise, the cluster may, sooner or later, end up with too many faux-offline servers, and will consider itself to have lost quorum.

NATS Helm Chart Caveats - A Brief Diversion

The values.yaml file for the NATS Helm Chart has an option to set a serverNamePrefix , which you might be tempted to use to rename the servers in a helm chart deployed cluster.

This setting should only be changed before the first installation. Once the cluster is up and running, if you change this value, and then upgrade the helm release, you’ll cause all of the servers in the cluster to be simultaneously renamed. This will double the number of recorded servers in the cluster (half with the old name, and half with the new name, per the changed serverNamePrefix). Consequently, there will not be enough servers active in the cluster to retain quorum.

On nats-server 2.12 and later, reverting the serverNamePrefix change does not unbrick the cluster on its own. The original-named pods have to actually come back online – empty disks are fine, but the names must match – for the meta election to pass. See Recovery on 2.12 and Later below.

A Case Study

For the sake of this article, we’ll assume a helm chart ¹ deployed cluster of three servers. ²

We start by “breaking” this cluster, simply modifying the serverNamePrefix to rename the servers, and update the release.

First Indication of Trouble

Logs

The first indication of trouble is when you see this WRN warning and INF message in the logs:

[WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[INF] JetStream cluster no metadata leader

Events

Another indication is that the NATS pods fail to progress to a ready state; the NATS container specifically shows that it’s running, but the readiness is false.

The events will show a warning with the message

Readiness probe failed: HTTP probe failed with statuscode: 503

NATS CLI

Using the NATS CLI, running the nats server report jetstream will also show an error; depending on the cluster state it could be any one of the following:

Before Quorum is Lost

At first, you’ll just see half the servers (with the old name) as being offline:

$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                 │
├─────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server              │ Cluster        │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0  │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-1  │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-2* │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                     │                │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
╰─────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                        RAFT Meta Group Information                                        │
├─────────────────────────────────────────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name                                                │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────────────────────────────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ Server name unknown at this time (peerID: Wp0X92Zu) │ Wp0X92Zu │        │ false   │ false  │ 0s     │ 0   │
│ nats-helm-kind-0                                    │ YMpQSy04 │        │ false   │ false  │ 19.53s │ 1   │
│ nats-helm-kind-1                                    │ MGRogjE4 │        │ false   │ false  │ 0s     │ 13  │
│ x-nats-helm-kind-0                                  │ svvjmHnE │        │ true    │ true   │ 526ms  │ 0   │
│ x-nats-helm-kind-1                                  │ XCzEfWSa │        │ true    │ true   │ 525ms  │ 0   │
│ x-nats-helm-kind-2                                  │ XGX0cX6V │ yes    │ true    │ true   │ 0s     │ 0   │
╰─────────────────────────────────────────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

After Quorum is Lost

After the quorum is lost, but before the readiness probes cause the servers to stop responding, for a brief window of time you’ll see the error in the jetstream report:

$ nats server report jetstream
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                │
├────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server             │ Cluster        │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0 │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-1 │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-2 │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
├────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                    │                │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
╰────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯


WARNING: No cluster meta leader found. The cluster expects 6 nodes but only 3 responded. JetStream operation require at least 4 up nodes.

After Servers Stop Responding

Finally, the servers will possibly stop responding, giving you the general error:

$ nats server report jetstream
nats: error: nats: no servers available for connection
command terminated with exit code 1

Recovery on 2.12 and Later

On 2.12 and later, the scale-up recovery path no longer works. A fresh empty node can no longer force itself into the peer set on first contact (see PR #7038 ), so adding a new node to a cluster that has lost quorum will not trigger a meta election.

The safe recovery path is to bring the renamed (or downed) nodes back under their original names. The cluster recognizes peers by name; restoring the original names is what re-establishes quorum.

Revert the Rename

For the canonical case in this article (a bulk rename caused by changing serverNamePrefix), revert the change in values.yaml and upgrade the helm release:

$ helm upgrade <release-name> nats/nats -f values.yaml

A helm upgrade updates the StatefulSet template in place without deleting pods, so the existing PVCs stay attached and the originally-named pods come back with their original JetStream data intact. Once the originally-named nodes are online, the meta election succeeds and quorum is regained.

Verify with:

$ nats server report jetstream

Once a leader is elected, clean up any stale peer entries left over from the failure (see Clean Up Stale Peers below).

When the Original Disks Are Wiped

If the PVCs were deleted or the storage was destroyed but the hostnames can be reused, bring the originally-named nodes back with empty disks. On 2.12+, a peer with an empty disk uses an “empty vote” (introduced by PR #7038 ) that only counts when all peers in the original set vote, so every originally-named node has to be available at the same time for the leader election to pass. If even one original peer cannot come back, this path is also blocked.

Clean Up Stale Peers

After quorum is restored, the meta group may still list peer entries left over from the failure (for example, the new-prefix server names that were added before the rename was reverted). Remove them with peer-remove, using the peer ID from nats server report jetstream:

$ nats server cluster peer-remove -f <peer ID>

When Original Names Cannot Be Reclaimed

If the original hostnames cannot be brought back at all (machines truly destroyed, names unrecoverable), there is no in-product recovery path on 2.12 and later at the time of writing. Restore from backup.

Recovery on 2.10 and 2.11

The procedure below relies on the pre-2.12 behavior where a fresh empty node could force itself into the peer set on first contact; that path is closed on 2.12 and later (see the note at the top and Recovery on 2.12 and Later ).

Regain Quorum

To recover, the cluster must first regain quorum. In this case, the cluster thinks that there are six nodes ³ in the cluster, so to regain quorum there needs to be a minimum of four nodes ³ reachable from each other.

The way to do this is to add one more server, which will allow quorum to be regained. You can do this by scaling the stateful set:

$ kubectl scale --replicas=4 statefulset/nats-helm-kind

Which, once complete, should result in quorum being regained:

$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                 │
├─────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server              │ Cluster        │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0  │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-1  │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-2* │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
│ x-nats-helm-kind-3  │ nats-helm-kind │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                     │                │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
╰─────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                        RAFT Meta Group Information                                        │
├─────────────────────────────────────────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name                                                │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────────────────────────────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ Server name unknown at this time (peerID: Wp0X92Zu) │ Wp0X92Zu │        │ false   │ false  │ 0s     │ 0   │
│ nats-helm-kind-0                                    │ YMpQSy04 │        │ false   │ false  │ 47m27s │ 6   │
│ nats-helm-kind-1                                    │ MGRogjE4 │        │ false   │ false  │ 0s     │ 18  │
│ x-nats-helm-kind-0                                  │ svvjmHnE │        │ true    │ true   │ 461ms  │ 0   │
│ x-nats-helm-kind-1                                  │ XCzEfWSa │        │ true    │ true   │ 461ms  │ 0   │
│ x-nats-helm-kind-2                                  │ XGX0cX6V │ yes    │ true    │ true   │ 0s     │ 0   │
│ x-nats-helm-kind-3                                  │ G7oD67bf │        │ true    │ true   │ 461ms  │ 0   │
╰─────────────────────────────────────────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

Remove Offline/Old Servers

Now you can begin cleaning up the old server records. You can do this either with the CLI or by using NATS directly.

CLI

Using the CLI ⁴:

$ nats server cluster peer-remove -f <peer ID>

Using NATS Directly

You can also remove a peer directly by publishing to the JetStream API subjects:

$ nats publish '$JS.API.SERVER.REMOVE' '{"peer":"","peer_id":"YMpQSy04"}'

Which will send a response message on the same channel that confirms the action:

{
  "type": "io.nats.jetstream.api.v1.meta_server_remove_response",
  "success": true
}

Remove Temporarily Added Server

Scale Back Down to Three Servers

Now that the number of servers is four instead of six, it’s safe to scale back down to three servers, and then remove the record of the server we temporarily added.

$ kubectl scale --replicas=3 statefulset/nats-helm-kind

Remove Peer

$ nats server cluster peer-remove -f G7oD67bf

Success!

Finally, the state of the cluster should be restored now, with three servers.

About The Author

John Weldon is a Customer Solutions Architect at Synadia Communications .

The name of the helm release in this article is nats-helm-kind; it could be anything, often the default is just nats ↩︎
You can replicate the kind environment I used in this article, by referring to this repository. ↩︎
NATS Servers are also called “nodes” - sometimes interchangeably. ↩︎
Throughout this article I use nats-box to execute nats commands; the simple way to do it from the command line is: kubectl exec -it deployment/nats-helm-kind-box -- nats <command> <args>
nats-box is deployed by default in the nats helm chart.
For simplicity, I’ll just show the plain NATS command in the examples. ↩︎

Back to Blog