Recovering Quorum After Renaming Servers

John Weldon — January 20, 2024

Occasionally, a NATS cluster can lose quorum for various reasons. Here, we’ll look at one specific case, and how to recover from it.

Context

How To Rename Servers in a Cluster

The recommended way to rename NATS servers in a cluster is to rename one at a time. After each rename, the cluster will have a record of both the old name and the new name. The former will appear offline, and the latter should appear online. You should remove the record of the old name before renaming the next server, otherwise, the cluster may, sooner or later, end up with too many faux-offline servers, and will consider itself to have lost quorum.

NATS Helm Chart Caveats - A Brief Diversion

The values.yaml file for the NATS Helm Chart has an option to set a serverNamePrefix , which you might be tempted to use to rename the servers in a helm chart deployed cluster.

This setting should only be changed before the first installation. Once the cluster is up and running, if you change this value, and then upgrade the helm release, you’ll cause all of the servers in the cluster to be simultaneously renamed. This will double the number of recorded servers in the cluster (half with the old name, and half with the new name, per the changed serverNamePrefix). Consequently, there will not be enough servers active in the cluster to retain quorum.

A Case Study

For the sake of this article, we’ll assume a helm chart 1 deployed cluster of three servers. 2

We start by “breaking” this cluster, simply modifying the serverNamePrefix to rename the servers, and update the release.

First Indication of Trouble

Logs

The first indication of trouble is when you see this WRN warning and INF message in the logs:

[WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[INF] JetStream cluster no metadata leader

Events

Another indication is that the NATS pods fail to progress to a ready state; the NATS container specifically shows that it’s running, but the readiness is false.

The events will show a warning with the message

Readiness probe failed: HTTP probe failed with statuscode: 503

NATS CLI

Using the NATS CLI, running the nats server report jetstream will also show an error; depending on the cluster state it could be any one of the following:

Before Quorum is Lost

At first, you’ll just see half the servers (with the old name) as being offline:

$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                 │
├─────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server              │ Cluster        │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0  │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-1  │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-2* │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                     │                │ 0000 B   │ 0 B    │ 0 B  │ 00       │
╰─────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                        RAFT Meta Group Information                                        │
├─────────────────────────────────────────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name                                                │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────────────────────────────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ Server name unknown at this time (peerID: Wp0X92Zu) │ Wp0X92Zu │        │ falsefalse  │ 0s     │ 0   │
│ nats-helm-kind-0                                    │ YMpQSy04 │        │ falsefalse  │ 19.53s │ 1   │
│ nats-helm-kind-1                                    │ MGRogjE4 │        │ falsefalse  │ 0s     │ 13  │
│ x-nats-helm-kind-0                                  │ svvjmHnE │        │ truetrue   │ 526ms  │ 0   │
│ x-nats-helm-kind-1                                  │ XCzEfWSa │        │ truetrue   │ 525ms  │ 0   │
│ x-nats-helm-kind-2                                  │ XGX0cX6V │ yes    │ truetrue   │ 0s     │ 0   │
╰─────────────────────────────────────────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

After Quorum is Lost

After the quorum is lost, but before the readiness probes cause the servers to stop responding, for a brief window of time you’ll see the error in the jetstream report:

$ nats server report jetstream
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                │
├────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server             │ Cluster        │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0 │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-1 │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-2 │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
├────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                    │                │ 0000 B   │ 0 B    │ 0 B  │ 00       │
╰────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯


WARNING: No cluster meta leader found. The cluster expects 6 nodes but only 3 responded. JetStream operation require at least 4 up nodes.

After Servers Stop Responding

Finally, the servers will possibly stop responding, giving you the general error:

$ nats server report jetstream
nats: error: nats: no servers available for connection
command terminated with exit code 1

How To Recover

Regain Quorum

To recover, the cluster must first regain quorum. In this case, the cluster thinks that there are six nodes 3 in the cluster, so to regain quorum there needs to be a minimum of four nodes 3 reachable from each other.

The way to do this is to add one more server, which will allow quorum to be regained. You can do this by scaling the stateful set:

$ kubectl scale --replicas=4 statefulset/nats-helm-kind

Which, once complete, should result in quorum being regained:

$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                 │
├─────────────────────┬────────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server              │ Cluster        │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ x-nats-helm-kind-0  │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-1  │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-2* │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
│ x-nats-helm-kind-3  │ nats-helm-kind │ 0000 B   │ 0 B    │ 0 B  │ 00       │
├─────────────────────┼────────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                     │                │ 0000 B   │ 0 B    │ 0 B  │ 00       │
╰─────────────────────┴────────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                        RAFT Meta Group Information                                        │
├─────────────────────────────────────────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name                                                │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────────────────────────────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ Server name unknown at this time (peerID: Wp0X92Zu) │ Wp0X92Zu │        │ falsefalse  │ 0s     │ 0   │
│ nats-helm-kind-0                                    │ YMpQSy04 │        │ falsefalse  │ 47m27s │ 6   │
│ nats-helm-kind-1                                    │ MGRogjE4 │        │ falsefalse  │ 0s     │ 18  │
│ x-nats-helm-kind-0                                  │ svvjmHnE │        │ truetrue   │ 461ms  │ 0   │
│ x-nats-helm-kind-1                                  │ XCzEfWSa │        │ truetrue   │ 461ms  │ 0   │
│ x-nats-helm-kind-2                                  │ XGX0cX6V │ yes    │ truetrue   │ 0s     │ 0   │
│ x-nats-helm-kind-3                                  │ G7oD67bf │        │ truetrue   │ 461ms  │ 0   │
╰─────────────────────────────────────────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

Remove Offline/Old Servers

Now you can begin cleaning up the old server records. You can do this either with the CLI or by using NATS directly.

CLI

Using the CLI 4:

$ nats server cluster peer-remove -f <peer ID>

Using NATS Directly

You can also remove a peer directly by publishing to the JetStream API subjects:

$ nats publish '$JS.API.SERVER.REMOVE' '{"peer":"","peer_id":"YMpQSy04"}'

Which will send a response message on the same channel that confirms the action:

{
  "type": "io.nats.jetstream.api.v1.meta_server_remove_response",
  "success": true
}

Remove Temporarily Added Server

Scale Back Down to Three Servers

Now that the number of servers is four instead of six, it’s safe to scale back down to three servers, and then remove the record of the server we temporarily added.

$ kubectl scale --replicas=3 statefulset/nats-helm-kind

Remove Peer

$ nats server cluster peer-remove -f G7oD67bf

Success!

Finally, the state of the cluster should be restored now, with three servers.

About The Author

John Weldon is a Customer Solutions Architect at Synadia Communications .


  1. The name of the helm release in this article is nats-helm-kind; it could be anything, often the default is just nats ↩︎

  2. You can replicate the kind environment I used in this article, by referring to this repository. ↩︎

  3. NATS Servers are also called “nodes” - sometimes interchangeably. ↩︎

  4. Throughout this article I use nats-box to execute nats commands; the simple way to do it from the command line is: kubectl exec -it deployment/nats-helm-kind-box -- nats <command> <args>

    nats-box is deployed by default in the nats helm chart.

    For simplicity, I’ll just show the plain NATS command in the examples. ↩︎


Back to Blog