January 8, 2021

CoreOS replacement and update with Talos OS: a tale of love and heartbreak.

Justin Garrison

CoreOS to Talos: a journey

I loved CoreOS. Since before Kubernetes, since the early days when etcd was a brittle configuration storage database, since fleet was a premiere cluster scheduling system… I had used, worked on, worked with, and loved CoreOS. After a year, all of my new x86 servers, regardless of their locations or roles, were running the OS, and it was glorious.

After many lovely years of automatic updates, the addition of many excellent tools, and a few fantastic conferences (the first CoreOS Fest had food coordinated by Kelsey Hightower; it still stands out as the conference with the best food, by far), the news came out that RedHat had purchased CoreOS. It was crushing. I tried to hope for the best. I was happy for the CoreOS people individually, but I was very much afraid that the CoreOS I knew and loved had its days numbered.

I gave it a shot, though. They promised a transition path. I knew these people. I trusted them. If they told me everything would be alright, I believed them. Then came the changes. I knew they were coming; I tried to brace myself, but there is only so much you can do. Yeah, RPMs. Okay, I could have guessed. When I handed over my Ceph containerization to RedHat, that was one of the first things to come on the scene: RPMs. Fine, whatever. Then came NetworkManager. What on Earth? This is a server, not a laptop!

The changes kept coming, and I started to realize those engineers I knew and trusted? They weren’t the ones in charge. There was a flood of new RedHat engineers. They didn’t understand the product. They didn’t have the same values. Their goal was just to make an engine for OpenShift. Nothing else mattered. (OpenShift is not bad. But it is not CoreOS, and not Talos Linux. See a comparison of Talos as an OpenShift alternative.)

Fedora CoreOS had become CoreOS in name only. It was a complete reversal in the values and design principles I believed in. I knew the pull, of course. I’ve been a Linux systems engineer since the mid 90s. I know the call of having all your old tools back, knowing you can bodge anything around to get it back into shape. I really do understand.

I just don’t agree. The new ways really are better.

So I left. I was broken-hearted and dreading going back to the bad old days of running a full, manual OS on all my servers. I left, and I started hunting. When it came down to it, I refused to go back to the bad old days. I would not submit to regressive engineering. Whether I had to build a new Linux distribution myself, I would not simply tuck my tail and crawl back into my past.

It was then that I discovered Talos OS. At the time, it was very early, barely more than a concept, but it was enough to get my thoughts running. Literally writing the OS itself from scratch, based on just the Linux kernel? That’s exactly what we need. An OS which is completely controlled by an API? Oh, boy, sign me up! No shell, no SSH, a truly immutable image? These people really know what they’re talking about. And… Kubernetes first!

See, CoreOS had come along before Kubernetes. As a result, it had to be targeted more generally. They quickly embraced Kubernetes when it came along, for good reason: it is the continuance of the design philosophy up the stack. However, it was always plagued by a misalignment later in life, where it made a good kubernetes base OS, but not a great one.

Talos, on the other hand, took Kubernetes as a first principle. Talos was designed from the start as a minimal base OS for Kubernetes. It is baked into the OS, sure, but also into the design of the OS. In fact, at many turns, Talos has chosen to model interfaces after Kubernetes. The goal is to make a seamless continuum of declarative configuration from the top of the stack to the bottom.

When I first started using Talos over a year and a half ago, it was a little rough, but even then, the returns were immediately obvious. While I thought a fully-API-driven OS was a good idea, after I started using it, Talos proved that it was a great idea.

For example:

Say you want to roll out a new configuration to change the DNS servers on a set of machines. A single API call will allow you to push a new configuration to a machine. You can then monitor its update, verify the machine can resolve DNS queries, and generally evaluate the effects before moving on to the next machine.

All of this can be done with something as simple as a shell script or as expressive as a real programming language (any language which speaks gRPC, in fact). Moreover, this can be executed securely from any machine, be it a workstation, a server, or even as a workload from inside the cluster. You could do this with scripting and SSH, sure, but it would be a lot more work, with lots of translation, character and expansion escaping, and very little reliable error handling. With an API driven OS, the process to update one machine or 10000 machines is the same.

CoreOS had a pretty declarative OS config with Ignition, but it was still procedural in many ways. Talos has a purely declarative config, since there is no scripting available anyway. Composition of those Ignition configs in CoreOS was always something of a chore. Talos is really embracing composable configs.

Ultimately, the old CoreOS and the new Talos come at the problem from different perspectives, and they don’t share any code beyond the Linux kernel. However, Talos can very much be thought of as a spiritual successor of CoreOS, refining and continuing the progress into the age of being able to ignore the underlying host, the infrastructure altogether. With Kubernetes, it is enabling us to focus on workloads and not on infrastructure. And that is the goal. So now I have a new love – Talos OS.

CoreOS replacement and update with Talos OS: a tale of love and heartbreak.

Justin Garrison

Hobby

Startup

Business

Enterprise

Edge