May 22, 2025

The Challenges of Using Kubernetes at Scale

Sean Saperstein

Kubernetes is known for being powerful and flexible. It’s also known for being difficult to understand and maintain, requiring expert knowledge. Once an infrastructure is established, its users often avoid making changes or updates. This can be relatively easy to manage in small-scale deployments as the infrastructure is not yet overly complex.

However, Kubernetes infrastructure management looks very different when done on a large scale. When there are hundreds of nodes or pods, teams often avoid making critical updates out of fear they will break the system, or they add new tools to solve emerging issues. These actions introduce vulnerabilities and fragility, leading many infrastructures to resemble a house of cards. Due to Kubernetes’ complexity, businesses who adopt a “set it and forget” mindset will face an increasing number of problems, including:

Delayed upgrades due to brittle automation
Unexpected problems caused by decisions made when the infrastructure was small
Disjointed infrastructure resulting from unchecked and undocumented fixes
Infrastructure costs that balloon out of control

Let’s take a closer look at the challenges organizations will face as their infrastructure scales and how to tackle them.

Exorbitant Cloud Costs

Cloud computing promised easy, affordable access to infrastructure. In many ways, it delivered. Cloud computing enables businesses to minimize up-front costs, outsource the management of hardware, and get infrastructure up and running quickly. This speed and adaptability can be critical for businesses looking to get ahead and initially caused “cloud-first” to become a popular shorthand for “modern.”

Today, what was once modern is now considered legacy, and “large-scale cloud” is synonymous with “expensive.” For industries with particularly high compute needs, cloud storage costs are the second highest expense after personnel. The once-prized short time to market pales in comparison to the long-term costs, and teams that do not look further than cloud will find themselves locked in and overpaying for a solution that no longer makes sense.

Configuration Drift

No matter how many times you say you treat nodes like cattle, an increase in nodes and clusters will lead to configuration drift. Teams rely on standardization policies to ensure consistency across deployments, but when a developer needs to make a late-night fix, they will still SSH if they feel it’s necessary to get the job done. Whether it’s a troublesome node or outdated config file, every deployment will have outliers.

Highly tailored infrastructures function perfectly on paper but rarely in practice. Troubleshooting a system full of snowflakes is like finding a needle in a haystack, and its management leads to downtime and, ultimately, more workarounds for the workarounds. This also leads to one of the most dangerous shortcomings of large Kubernetes deployments: Vulnerability.

Fewer Upgrades, More Vulnerabilities

At the individual level, managing a Kubernetes infrastructure is not necessarily difficult. Upgrades and oversight are an ongoing part of operations, but without the right foundation, they become both time-consuming and risky as clusters and deployments scale.

With a large or distributed infrastructure, “upgrade” might as well be a four-letter word, where teams are unable to predict whether machines will come back in the right state or if they’ll come back at all. In the best case scenario, a team will spend days successfully performing a simple upgrade. Many teams avoid making upgrades altogether to avoid breaking the infrastructure, which introduces vulnerabilities and affects performance. This is why so many Kubernetes clusters are several versions behind.

Out-of-date clusters create more security risks and makes systems vulnerable. Protocols and diligence keep the system secure 99% of the time, but even the most diligent team can become subject to ransomware and other security threats. They risk losing their data, not knowing where it may have been distributed, and paying a high price simply to return to the status quo.

Discrepancies Between Locations

After all that, another major hurdle remains: multi-location Kubernetes.

Deployments spread across regions and environments support high availability clusters, reduce customer latency, and offer new opportunities for cost-cutting, but most providers make this difficult, leaving infrastructures constricted to a single environment or region. Breaking out of these boundaries generally requires manual deployment, expensive direct connections, and complex network routing. In short, providers make it impossible to create consistency across the entire deployment, locking users into expensive and rigid setups.

What to Do About the Kubernetes Problem

None of these pains are new. Cloud costs, security vulnerabilities, and configuration drift have been troubling tech teams since long before Kubernetes. Year after year, solutions are introduced to fix large-scale Kubernetes, but these piecemeal solutions lead to even more complex and fragile systems. Eventually, businesses will find their stack is filled with specialized products to solve problems they shouldn’t be facing.

We built Talos Linux and Omni to replace this endless chain of one-offs. Rather than fight fires one-by-one, we built an operating system and enterprise-grade management interface to fix the stack from the bottom up. Here’s what you’ll want to know.

Talos Linux and Omni keep your infrastructure flexible. Large-scale infrastructures are built to last for years, erasing the real value of time to market and cloud-first. Teams greatly benefit from the flexibility to mix and match and optimize for cost over time, particularly by prioritizing on-prem compute and bursting to the cloud when required. This allows businesses to pay for what they need, when they need it, and nothing more. We’ve seen companies cut infrastructure costs by 90% by moving from cloud to hybrid. Virtually every organization can benefit from taking a closer look at how their infrastructure is set up.

Talos Linux and Omni prioritize your control. It doesn’t matter whether you’re using cloud, on-prem, edge, or any mixture thereof. Talos Linux works equally across all of them, and Omni provides the user-friendly interface to manage all of it. Omni enables enterprise Kubernetes management for teams of all sizes.

Talos Linux and Omni are built on the philosophy of simplicity. When you have a lot of complexities, workarounds, and manual changes, something will go wrong. It’s only a matter of time. That’s why we cut out the complexity and created one of the smallest attack surfaces you can imagine. Ubuntu and Flatcar both have 2000+ binaries. Talos Linux has less than 50. With Talos Linux, you’re exposed to far fewer CVEs by default. With Talos Linux and Omni, you’re more secure from day one.

Talos Linux is immutable. Human error is one of the largest threats to an otherwise powerful infrastructure. That’s why Talos Linux has no SSH or shell. By encouraging the construction of reliable infrastructure from the start, teams avoid ad hoc changes and put a stop to configuration drift. With a secure OS, teams can redirect attention to application security, where there is more sensitive data.

Traditional Kubernetes management at scale is challenging, time-consuming, and riddled with complexity. Talos Linux and Omni are simple. They provide the foundation you need to build a reliable infrastructure that grows with you–no more config drift or fear of upgrades. You get the “it just works” experience. Let’s talk about what Sidero Labs can do for your infrastructure.

The Challenges of Using Kubernetes at Scale

Sean Saperstein

Exorbitant Cloud Costs

Configuration Drift

Fewer Upgrades, More Vulnerabilities

Discrepancies Between Locations

What to Do About the Kubernetes Problem

Hobby

Startup

Business

Enterprise

Edge

The Challenges of Using Kubernetes at Scale

Sean Saperstein

Exorbitant Cloud Costs

Configuration Drift

Fewer Upgrades, More Vulnerabilities

Discrepancies Between Locations

What to Do About the Kubernetes Problem

Hobby

Startup

Business

Enterprise

Edge

TalosCon 2025

Our annual gathering and user conference, October 16, Amsterdam.