Why should a Kubernetes control plane be three nodes?

Why does Sidero Labs recommend 3 nodes for the control plane, in most cases? Why not 4? Or 10? Read on to find out!

A Kubernetes cluster generally consists of two classes of nodes: workers, which run applications, and control plane nodes, which control the cluster — scheduling jobs on the workers, creating new replicas of pods when the load requires it, etc.

The control plane runs the components that allow the cluster to offer high availability, recover from worker node failures, respond to increased demand for a pod, etc. Without a fully functioning control plane, a cluster cannot make any changes to its current state, meaning no new pods can be scheduled. As such, ensuring high availability for the control plane is critical in production environments.

(Technically, you can schedule pods on the control plane nodes, so you don’t need two classes of nodes, but it is not recommended for production clusters — you do not want workload requirements to take resources from the components that make Kubernetes highly available. You also eliminate the chance that a vulnerability enables a workload to access the secrets of the control plane — which would give full access to the cluster.)

A managed service may deal with the control plane for you, but if you are running Kubernetes on bare metal, or on a cloud provider and running Kubernetes yourself for better security or to avoid cloud lock in, you need to ensure your control plane is always up. So how do you ensure the control plane is highly available? Kubernetes achieves high availability by replicating the control plane functions on multiple nodes. But how many nodes should you use?

May the Odds Be Ever in Your Favor

It’s not quite as simple as “more is better.” One of the functions of the control plane is to provide the datastore used for the configuration of Kubernetes itself. This information is stored as key-value pairs in the etcd database. Etcd uses a quorum system, requiring that more than half the replicas are available before committing any updates to the database. So a two-node control plane requires not just one node to be available, but both the nodes (as the integer that is “more than half” of 2 is …2). That is, going from a single-node control plane to a 2-node control plane makes availability worse, not better.

In the case of a 2-node control plane, when one node can’t reach the other, it doesn’t know if the other node is dead (in which case it may be OK for this surviving node to keep doing updates to the database) or just not reachable. If both nodes are up, but can’t reach each other, and continue to perform writes, you end up with a split-brain situation, where the two halves of the cluster have inconsistent copies of data, with no way to reconcile them. Thus the safer situation is to lock the database and prevent further writes by any node. And, because the chance of one node dying is higher with 2 nodes than with one node (twice as high, in fact, assuming they are identical nodes) then the reliability of a 2-node control plane is worse than with a single node!

This same logic applies as the control plane nodes scale — etcd will always demand that more than half of the nodes be alive and reachable in order to have a quorum, so the database can perform updates.

Thus a 2-node control plane requires both nodes to be up. A 3-node control plane also requires that 2 nodes be up. A 4-node control plane requires 3 nodes to be up. So a 4-node control plane has worse availability than a 3-node control plane — as while both can suffer a single node outage, and neither can deal with a 2-node outage, the odds of that happening in a 4-node cluster are higher. Adding a node to an odd-sized cluster appears better (since there are more machines), but the fault tolerance is worse since precisely the same number of nodes may fail without losing quorum, but there are more nodes that can fail.

Thus a general rule is to always run an odd number of nodes in your control plane.

Too Much of a Good Thing?

So we want an odd number of nodes. Does that mean 3 nodes, or 5, or 23?

The etcd docs say “…an etcd cluster probably should have no more than seven nodes…. A 5-member etcd cluster can tolerate two-member failures, which is enough in most cases. Although larger clusters provide better fault tolerance, the write performance suffers because data must be replicated across more machines.”

The Kubernetes docs say “It is highly recommended to always run a static five-member etcd cluster for production Kubernetes clusters at any officially supported scale. A reasonable scaling is to upgrade a three-member cluster to a five-member one, when more reliability is desired.”

So these documents both imply scaling the etcd cluster is only for fault tolerance, not performance, and indeed, that scaling up members will reduce performance.

Let’s see if this is true. To test this, I created Kubernetes clusters with control planes of various sizes, ranging from three nodes to nine nodes, on two different sizes of machines. In all cases, I use a “stacked” etcd topology, where the etcd service runs on each control plane node. (As opposed to an external etcd topology, with separate servers for the Kubernetes control plane and the etcd cluster.)

All tests were done using Talos Linux v1.2.0-alpha.1, which installs Kubernetes v1.24.2 and etcd version v3.5.4, using an AWS c3.4xlarge instance to run the tests from.

Results

The tests that were run were taken from the Performance documentation of etcd, using the etcd benchmark tool:

Write to leader:

go run ./benchmark –endpoints=$ENDPOINT –target-leader –conns=100 –clients=1000 put –key-size=8 –sequential-keys –total=100000 –val-size=256 –cert ./certs/admin.crt –key ./certs/admin.key –cacert ./certs/ca.crt

Write to all members:

go run ./benchmark –endpoints=$ENDPOINTS –conns=100 –clients=1000 put –key-size=8 –sequential-keys –total=100000 –val-size=256 –cert ./certs/admin.crt –key ./certs/admin.key –cacert ./certs/ca.crt

Linearized concurrent read requests:

go run ./benchmark –endpoints=$ENDPOINTS –conns=100 –clients=1000 range YOUR_KEY –consistency=l –total=100000 –cert ./certs/admin.crt –key ./certs/admin.key –cacert ./certs/ca.crt

Read requests with sequential consistency only (served by any etcd member, instead of a quorum of members, in exchange for possibly serving stale data):

go run ./benchmark –endpoints=$ENDPOINTS –conns=100 –clients=1000 range YOUR_KEY –consistency=s –total=100000 –cert ./certs/admin.crt –key ./certs/admin.key –cacert ./certs/ca.crt

Each record shows the operations per second of 3 consecutive runs of each test, averaged:

Node typeTest3579
m4.largeWrite to leader2200206919991928
Write to all members5815739677787406
Linearized concurrent read requests9100155402065525646
Read requests with sequential consistency only13312227563130540723
m4.2xlargeWrite to leader12680121811158611013
Write to all members14757254192373121789
Linearized concurrent read requests38409549187387981548
Read requests with sequential consistency only5965395124115605122488

All operations except committing writes to the leader increase in performance as the cluster scales. However, writes to the leader are the most critical operations that must be committed in order for the integrity of the cluster to be ensured. The performance of such writes degrade as the cluster scales, as it has to commit them to more and more members.

Going from a 3-node etcd cluster to a 5-node cluster reduces performance by 4 to 5%; a 7-node cluster takes off another 4%, and going to a 9-node cluster takes off another 4 or 5%, for a total degradation of about 13% worse throughput than a 3-node cluster.

So, for a given size of control plane machines, a 3-node cluster will give the best performance, but can only tolerate one node failing. This will be adequately engineered for most environments (assuming you have good monitoring and processes in place to deal with a failed node in a timely manner), but if your application needs to be very highly available and tolerate 2 control plane node failures simultaneously, a 5-node cluster will only incur about a 5% performance penalty.

How about Auto-Scaling?

It should be apparent by now that autoscaling of etcd clusters, in the sense of adding more nodes in response to high CPU load, is a Bad Thing. As we saw from the benchmarks, adding more nodes to the cluster will decrease performance, due to the need to synchronize updates across more members. Additionally, autoscaling also exposes you to situations where you will likely be running with an even number of cluster members, at least transiently as the scaling operations happen, thus increasing the chance of node failures impacting the availability of etcd.

Indeed, the official Kubernetes documentation explicitly says:

“A general rule is not to scale up or down etcd clusters. Do not configure any auto-scaling groups for etcd clusters. It is highly recommended to always run a static five-member etcd cluster for production Kubernetes clusters at any officially supported scale.”

It is reasonable to use an autoscaling group to recover from the failure of a control plane node, or to replace control plane nodes with ones with more CPU power (this is what Amazon Web Services’ EKS means when talking about control pane autoscaling). However, there are some niceties that need to be observed when replacing control plane members, even failed ones — it is NOT as simple as just adding a new node!

In Case of Emergency, Break Glass, Remove Failed Node, Then Add New Node

Superficially, adding a new node and then removing the failed node seems the same as removing the failed node then adding a new one. However, the risks are greater in the former case.

To see why this is so, consider a simple 3-node cluster. A 3-node cluster will have a quorum of 2. If one node fails, the etcd cluster can keep working with its remaining two nodes. However, if you now add a new node to the cluster, quorum will be increased to 3, as the cluster is now a 4-node cluster, counting the down node, and we need more than half available to prevent split brain.

If the new member was misconfigured, and cannot join the cluster, you now have two failed nodes, and the cluster will be down and not recoverable: there will only be two nodes up, and a required quorum of 3.

Compare this to removing the failed node first. Once we remove the failed node, we now have a 2-node cluster, with a 2-node quorum, and two nodes up (so we cannot tolerate any further failures, but we can operate fine in this state.) If we now add a node, creating a 3-node cluster, quorum remains at 2. If the new node fails to join, we still have 2 nodes up in our 3-node cluster, and can remove and re-add the new node again.

The key takeaway is that when an etcd member node fails, remove the failed node from etcd before attempting to replace it with a new node.

The process for enacting this is documented in the Kubernetes documentation here. However, if you are running Talos Linux, which is specifically designed as an operating system for Kubernetes, the process is significantly simpler. Talos Linux has helper functions that automate the removal of down etcd nodes:

talosctl etcd remove-member ip-172-31-41-76
kubectl delete node ip-172-31-41-76

You can then add in a new control plane node, which, with Talos Linux, is as simple as booting a new node with the controlplane.yaml that was used to create the other control plane nodes.

It should also be noted that Talos Linux uses the Learner feature of etcd — all new control plane nodes join etcd as non-voting learners, until they have caught up with all transactions. This means that adding an extra node does NOT increase quorum, until the node is a reliable member, and it is then automatically promoted to a voting member. This added safety does not change the advice to remove failed nodes before adding new ones, though.

If replacing a working node, add the New Node FIRST, Then Remove the Node to Be Replaced

To upgrade a still-functioning control plane node and replace it with a machine with more CPU or memory, the order of actions is the opposite compared to when the node has failed — add the new node first, then remove the old.

To understand why, consider the case of a 3-node etcd cluster, where you wish to upgrade a node to faster hardware. The quorum will require 2 nodes up in order for the cluster to continue processing writes.

One approach would be to remove the node to be replaced first. This will leave a 2-node cluster, with a quorum of 2. You then add the new node. This returns the cluster to a 3-node cluster with a quorum of 2, but during the transition, there is no fault tolerance — an inopportunely timed failure could take down the cluster.

Another approach is to add the new node first, creating a 4-node cluster, which would have a quorum of 3, then remove one node. This returns the cluster to a 3-node cluster with a quorum of 2, but at all times the cluster can tolerate the failure of a node.

Thus it is safer to add the new node before removing the node to be replaced, if it is operational.

The procedure to perform this is documented on the Kubernetes site here. Talos Linux again makes this easier:

  • Add the new control plane node by booting using the controlplane.yaml file.
  • Tell the node that is being replaced to  leave the cluster:
    talosctl  -n 172.31.138.87 reset
  • kubectl delete node

talosctl Reset” causes the node to be erased. Talos is aware when a node is a control plane node, and if so, it will gracefully leave etcd when reset.

Summary

To summarize Kubernetes control plane sizing and management:

  • Run your clusters with three or five control plane nodes. Three is enough for most use cases. Five will give you better availability, but cost you more both in the number of nodes required, and also as each node may require more hardware resources to offset the performance degradation seen in larger clusters.
  • Implement good monitoring and put processes in place to deal with a failed node in a timely manner (and test them!)
  • Even with robust monitoring and procedures for replacing failed nodes in place, backup etcd and your control plane node configuration to guard against unforeseen disasters.
  • Monitor the performance of your etcd clusters. If etcd performance is slow, vertically scale the nodes, not the number of nodes.
  • If a control plane node fails, remove it first, then add the replacement node.
  • If replacing a node that has not failed, add the new one, then remove the old.

This article first appeared on The New Stack.

Subscribe!

Occasional Updates On Sidero Labs, Kubernetes And More!