How Nokia runs one of the world's largest private clouds on Talos Linux

Nokia is, by any measure, a large corporation – it employees around 90,000 employees across more than 100 countries, and generates more than $25 billion in revenue across telecommunications, information technology, and other business units.

Janne Heino, Head Of Nokia Global Services Cloud Architecture, started looking at cloud architectures for Nokia in 2011. At that time, the large cloud providers of today did not have mature cloud offerings – he “got a lot of beautiful powerpoints”, but the infrastructure on offer didn’t actually work very well.

Instead, Janne led the project to develop NESC – the Nokia Enterprise and Services Cloud – internally. Once the internal cloud platform was established, and shown to be reliable, in 2014 NESC became the place where most teams did all internal Nokia development occurred. This led to a rapid acceleration of scaling – from 11,000 cores in the cloud to over 400,000 in just two years. The platform was based on OpenStack and Eucalyptus. NESC now consists of 11,000 servers in 350 racks worldwide, across 7 datacenters.

One of the things the NESC team learned in operating such a large platform, is that as users scale, everything breaks. For example, repositories tested in lab settings, where there were maybe 500 installations a year, did not work well where there were 10,000 installations every morning in a short time frame. And resolving one scaling issue moved the bottleneck to another point in the stack – everything needed upgrading! – servers, storage, DNS, networks, etc. 

The NESC team realized that they always need to be planning what to do next to scale. They also realized that when things scale, they break. Automation is the only way to maintain quality, and keep costs under control.

In 2019 it was clear Kubernetes was going to be the next big environment their internal teams wanted to consume at scale. NESC started exploring offering Kubernetes as a service in 2019.

How did they end up on Talos Linux?

When the NESC team was looking for the Operating System for their Kubernetes offering, they applied the learnings from their years of operating a large scale cloud service. They knew automation was going to be critical to operate reliably at scale, and Talos Linux, with its native API based management, seemed a good fit. Security was also critical – they wanted to make sure the footprint of the operating system was as minimal as possible, giving less to attack. They had been running Redhat and CentOS, but these systems installed with a huge number of services and packages, almost all of which were not needed if you are only running Kubernetes. By adopting Talos Linux, they found that the binaries on the systems (and thus possible vectors of attack, and things that needed management and updating) dropped from the thousands to the tens – multiple orders of magnitude of improvement.

“Talos Linux is faster, it’s lighter, it does what it is supposed to, and not something else.”

They also found Talos Linux is very efficient – it reduces overhead everywhere, leaving more resources for the workloads and generally improving latency.

It took 3 months to get an initial Kubernetes system out, and a total of 15 months to GA. NESC provides managed Kubernetes, similar to what most public clouds do, although going beyond cloud offerings. They monitor and automatically repair the nodes in the cluster. NESC also deploys standard logging and monitoring into the clusters, and handles upgrades of Kubernetes, and the logging and monitoring apps. They offer K8s on bare metal (using Sidero Metal – the Kubernetes CAPI provisioner for bare metal from Sidero Labs) and also virtual machines running Talos Linux for Kubernetes on OpenStack. The internal customer is responsible for user management, service accounts, etc. They don’t enforce what the end users do in the cluster – NESC uses ArgoCD to deploy the clusters, but end users can also deploy ArgoCD themselves.

Initially, there was a bit of a slow start – then adoption started exploding, and they went from 10k to 130K cores in Kubernetes in a year. They now have 320 Kubernetes clusters, across 130K cores, 5400 virtual nodes, 380 bare metal nodes, running 55,000 pods typically.

NESC does not deliver multi-tenant clusters, as some of the clusters the end users run are fairly large – the largest cluster is about 11,000 cores, running on bare metal.

They wrote their own API, the NKS API, to support Openstack, Sidero Metal and VMware. This helps deliver consistent cluster management, updates, and deletion. It also manages the auto-repair control, account quotas, cloud config management, and provides some debug facilities. Talos Linux is the default K8s provisioner for all environments.

NESC uses the same deployment model at all their large datacenters, and also at edge rack deployments: There is one Kubernetes cluster for the NKS master, which runs the API and local datastores. Another K8s cluster for monitoring – and then one or more customer clusters.

Their cost structure is about 1/3 of public cloud providers, accounting for all costs, including those of the datacenter space, power, people onsite, etc.

 

Services are what are coming next. They already offer GPUs, and Database-as-a-service, running on Talos Linux, in their Kubernetes cloud, but there will be more to come.

This article is a summary of one of the talks the NESC team gave at TalosCon 2023. Watch the full talk below. All the talks of TalosCon are available here: TalosCon 2023 Kubernetes talks playlist.