Building a Kubernetes cluster with Talos on Hetzner bare metal servers

Last TalosCon, Bunnyshell engineer Alex Oprisan shared his story of building bare metal Kubernetes clusters on Hetzner to run KubeVirt as a cost-effective way to provide VM-based development environments.
Bunnyshell is a SaaS platform that automates ephemeral environments for dev teams, spinning up a full environment per pull request, deploying to Kubernetes, and tearing it down when the PR closes. Traditionally these environments were container-based, but user feedback pointed to a clear preference for VMs. Because containers are immutable, difficult to extend with packages, and often distroless, they could be frustrating to debug..
The key constraint: development VMs are mostly idle on CPU but hold memory constantly. That made RAM per dollar the primary optimization target. This led the team to Hetzner, which offered them ample cost savings, especially on bare meta, where a three-node cluster yielded 1.5TB RAM and 192 vCPUs for under €600/month. That’s some one-tenth the AWS equivalent.
Here’s what they did.
Architecture
The chosen design was a hybrid cluster: a high-availability control plane on Hetzner VMs (low resource needs, easier to manage) with bare metal worker nodes. The full stack: Talos (OS), Cilium (CNI), Longhorn (CSI), KubeVirt, and kluctl for deployment. Hetzner's vSwitch connects metal and VM nodes via tagged VLAN.
Challenges & solutions
Networking: Metal and cloud servers aren't connected by default on Hetzner. The fix is creating a VPC-style network with dedicated subnets for each, connecting metal servers to a vSwitch, and exposing the cloud subnet to that vSwitch, enabling layer 2 communication between both node types.
Installing Talos Linux on VMs: Hetzner VMs couldn't boot Talos directly. The solution uses hcloud-upload-image to install Talos, snapshot the VM, and use that snapshot as a base image.
Installing Talos on metal: More involved. Hetzner provides only public IP and SSH access to a rescue environment. Talos must be installed over the public internet via SSH, curling the image and DD-copying it to disk. A further complication: disk names on metal servers are randomly assigned at boot and can change on reboot, bricking the server. Auction hardware also varies — different drive counts, names, and sizes. The solution was custom scripts that install Talos while simultaneously discovering and recording disk information per server IP in a central git repository. Each server gets a custom Talos config rendered from that collected data. [Update: Talos Linux 1.12 introduced a unified User Volume framework that provides a consistent storage interface regardless of hardware variation, which may reduce or eliminate the need for custom disk-discovery scripts.]
Node naming: Adding and removing physical nodes requires consistent, conflict-free private IP assignment. A manually maintained cluster node index file in git tracks server IPs; the install command references the index number to assign static addresses.
Configuration storage: Each cluster gets a dedicated git repository holding all config, including Talos version, schematic ID, networking ranges, VLAN tag, and more. No secrets stored. Shared across the team.
Storage: Talos consumes the full disk on install, which wastes large drives. Longhorn solves this by using free space on the Talos disk as well as additional disks, without requiring separate partitioning.
Bricking risk: A typo in a 200-line configuration file could brick a remote server, forcing a costly manual reset. The team mitigated this through the custom scripting and git-based config workflow described above. [Update: Talos Linux 1.12 introduced staged networking via multi-document configuration. Users can now establish a simple heartbeat connection first, then layer in VLANs and bonded interfaces only after connectivity is confirmed, directly addressing this risk.]
GitOps feedback loop: Early use of Argo/Flux meant a slow commit→push→deploy cycle — any typo cost roughly 10 minutes per iteration. kluctl replaced this. It supports local file deployment for rapid iteration and GitOps for production, letting the team prototype locally before committing to the full pipeline.
Bringing existing clusters under Omni management: At the time of writing, onboarding an existing Talos cluster into Omni required rebuilding nodes from scratch, risking production downtime. [Update: Omni now provides an experimental cluster import feature. A single CLI command brings an existing Talos Linux cluster under Omni management without downtime or infrastructure rebuild.]
The result?
The team found that once they had cracked the code, the cost savings were substantial, and the combination of kluctl, Longhorn, and git-based configuration management made their heterogeneous bare metal clusters workable at scale.


