October 12, 2021

Announcing KubeSpan: What it is, how it works, and why you need it.

steverfrancis

You are a cloud-native, savvy engineer. Kubernetes is your go-to for deploying and managing software. Yet sometimes it just feels so heavy doesn’t it? If you want to deploy software at a new location, or on a different cloud, that usually means creating a whole new Kubernetes cluster, control plane and all. If this sounds like a pain – and it is!! – we have the solution: KubeSpan.

The whole idea behind Kubernetes is to abstract away the infrastructure and focus on your applications. This works brilliantly for deployments in a single location, but when you want to span multiple sites or multiple networks, suddenly everything falls apart. Why should you care where your infrastructure is when the infrastructure itself is being abstracted away?

Even cloud providers do not solve this problem. Why can’t you span a Kubernetes cluster across regions? Some cloud providers don’t even let you span across availability zones within the same region.

Almost everyone wants a multi-location deployment for reasons like:

disaster recovery
performance and localisation
mixing on-demand/high-cost with fixed/low-cost infrastructure
physical point of use requirements (cash registers, data collection hardware, physical storage)
avoiding vendor lock-in

So while there is a lot of demand for multi-cloud and hybrid Kubernetes, there has been no easy solution. Now, this isn’t to say that it can’t be done. Ultimately, the requirement for most CNI plugins is that each node needs direct communication to each other node. This can be achieved in the multi-location scenario in a number of ways:

Full native IPv6 (good luck with this universality… and make sure your firewall works!)
All nodes with direct public IPv4 addresses (just in case you have an unnatural number of IPv4 addresses and like exposing your nodes to the internet)
VPN solutions (difficult or expensive in the cloud; requiring extra tooling and often extra hardware)

Mostly, these require external coordination, which is difficult or expensive to achieve in many environments.

Ideally, each node should be able to securely and directly communicate with each other node, regardless of where it is and what network it is on. This sounds like a job for…

WireGuard

WireGuard seeks to provide secure transport between two endpoints. Equally importantly, WireGuard is efficient for full mesh systems. While traditional VPNs scale poorly to full mesh deployments due to their overhead (both administrative and operational), WireGuard uses a very simple and efficient mechanism for managing a large number of direct peers.

So if WireGuard is so great, why isn’t it already being used?

The main difficulty with WireGuard is with key distribution and peer discovery. With a highly dynamic system like Kubernetes, where nodes should be treated as cattle and not pets, individually coordinating communication between them is cumbersome.

KubeSpan

Sidero Labs builds software to automate and manage Kubernetes clusters on baremetal, on premises, and in the cloud. Our core product is Talos OS, the Kubernetes OS, an extremely light-weight, read-only, image-based Linux operating system highly optimized for Kubernetes. It has no shell, no SSH, and its operation is entirely defined by a static manifest and is managed solely by API. If you are managing your own Kubernetes clusters, you want to be running those clusters on Talos OS.

We hear constantly about the need for hybrid and distributed Kubernetes deployments. Talos OS now enables automatic full mesh WireGuard deployments requiring no additional external tooling or configuration. Users need only set a single configuration flag, and all communication between their nodes will be transparently and automatically encrypted.

We call this system KubeSpan^TM.

KubeSpan delivers a solution to the coordination and key exchange problem, allowing all nodes to discover and communicate in an encrypted channel with all other nodes – even across NAT and firewalls. It supports roaming of devices, and transparently handles the correct encryption of traffic that is destined to another member of the cluster, while leaving other traffic unencrypted.

KubeSpan discovery

A node in a KubeSpan enabled Kubernetes cluster needs to know what other nodes are part of the cluster, and, to communicate securely with them, it needs to know:

the public key of the host to connect to
an IP address and port of the host to connect to

In our solution, we have a multi-tiered approach to gathering this information and keeping it up to date. Each tier can operate independently, but the amalgamation of the tiers produces a more robust set of connection criteria.

The initial release of KubeSpan supports two tiers:

an external discovery service
a Kubernetes-based system

We maintain a public discovery service whereby members of your cluster can use a shared (but globally unique) key to coordinate the basic information needed to get the encrypted link up: the public key and the set of possible endpoints (IP:port pairs) for a node. Nodes register with the discovery service under the cluster’s shared key on coming up and periodically thereafter, providing the discovery service with their public key and their IP address/port. They will also obtain all the extant members of their cluster, and their keys/network connection information. In the simple case, this will enable a full-mesh WireGuard network, but since when is networking simple?

NAT, multiple IPs and other complex networks

One of the difficulties in communicating across networks is that there is often not a single address and port which can identify a connection for all nodes on the system. For instance, node A might see node B sitting on the same network as 192.168.2.10, but node C across the internet may see node B as 2001:db8:1ef1::10. KubeSpan gets around this difficulty by having each node report the set of peer key/network connections that the node is aware of. Thus node A would report to the discovery server that node B is at 192.168.2.10, whilst node C would report that node B is at 2001:db8:1ef1::10. If node D newly joins the KubeSpan cluster, it will query the discovery server and be given both addresses for node B. Talos OS with KubeSpan continuously discovers and rotates these IP:port pairs until a connection is established.

Once traffic is received from a node, the network information necessary to return traffic is learned and updated by KubeSpan and WireGuard on the receiving node automatically – and then shared with the discovery service. This capability means that all nodes will rapidly obtain all possible ways of securely communicating to all other nodes – even if a node changes their IP, or moves behind a firewall.

For security reasons, the discovery service doesn’t see actual node information – it only stores and updates encrypted blobs. Discovery data is encrypted/decrypted by the clients – the cluster members. The discovery service does not have the encryption key.

Sidero offers a free public discovery service as… a public service. Customers with a support contract may choose to run their own discovery service.

The Kubernetes-based system of node discovery utilizes annotations on Kubernetes Nodes which describe each node’s public key and local addresses, achieving a similar result as the discovery service – but of course requires that a node has already established secure communication with the Kubernetes cluster.

Use cases – why you might use KubeSpan

Bursting to the public cloud

A VoIP provider can run their core infrastructure in their data center, using their fixed-cost assets most of the time. If they are wise, they will have used our Sidero Metal resource management system, which gives them a powerful Cluster API management plane and fully automated, network-booted servers. But in any case, their normal workload is run entirely within their own hardware.

When a high-traffic event occurs (say one of their bank customers needs to notify a large number of their customers in a short amount of time), they can add in additional resources from AWS to their existing cluster. They have a simple Autoscaling Group defined in AWS which adds nodes to the cluster (very quickly, if they are using Talos, which installs and boots into Kubernetes faster than anything else we know of.) When each node comes up, it discovers the WireGuard connection information for every other node in the cluster, and they discover the new node. Everyone connects to everyone else, and even though some of the nodes are in AWS and some are in the private datacenter, everyone talks as if they are in the same place.

The provider gets to manage their costs by keeping their normal workload on their own bare metal, while keeping the scalability of the cloud, and the simplicity of running it all as the same Kubernetes cluster.

Controlling local Kubernetes nodes from the cloud

A public transport organization has their Kubernetes control plane running in a cloud provider, but all of their rider data is stored on physical machines at a data center for privacy control. Each of their transportation hubs has several display boards which regularly receive updates of times and locations of their buses and trains. They even have sensors and detectors connected over WiFi and dispersed across a large number of locations.

And all of this is coordinated by a single Kubernetes cluster, that spans the cloud, the data center and the servers at the transport hubs. All of it communicates securely to the other resources in the cluster, even over the public WiFi and common network links. KubeSpan and Wireguard secure everything.

All data protection regulations are followed, but they are able to leverage the best resources for each component across a wide variety of locations, and still maintain the simplicity of managing a single Kubernetes cluster.

Curious to test KubeSpan yourself?

Set up a guided or self-drive proof of concept!

Announcing KubeSpan: What it is, how it works, and why you need it.

steverfrancis

WireGuard

KubeSpan

KubeSpan discovery

NAT, multiple IPs and other complex networks

Use cases – why you might use KubeSpan

Bursting to the public cloud

Controlling local Kubernetes nodes from the cloud

Curious to test KubeSpan yourself?

Hobby

Startup

Business

Enterprise

Edge