Add Your Heading Text Here

How to scale Kubernetes to support billions of transactions a day with low latency (and low cost!) - Tremor Video/Nexxen

Tremor Video (now a part of Nexxen) connects publishers and advertisers together for the display of pre-roll video ads (amongst other things.) They process many billions of requests a day, which involve complex algorithmic bidding and matching, and all these requests have to complete in less than 100 ms. They mostly operate on-premises, but also run in public clouds, and have done several acquisitions that need to be integrated.

They operate a lot of microservices, with different resource requirements, and they all need to scale appropriately with low latency. This is an expensive proposition, so they run most of their infrastructure on bare metal, but at times need to burst capacity. 

Initially, around 2016, they would deploy monolithic apps, with dedicated servers – whcih resulted in lots of wasted capacity on each server. They knew containers would address this issue. They started with Docker swarm and Consul, whcih allowed the container to be reachable directly. They ran into limitations, and decided to move to Kubernetes for the support of sidecars and other advantages.

Architecturally, apps are deployed and scaled with Kubernetes, and advertise a service to Consul. So services are generally running on bare metal nodes, every pod has its own IP address and routable IP address – avoiding ingresses and load balancers entirely, to minimise latency – so they use kube-router to peer to the top of the rack via BGP, and announce the pod IPs. 

They hand built their initial Kubernetes clusters with Kubeadm – but didn’t pay attention and allowed their certificates to expire (the “Kubesplosion”, which was not pretty.) They switched to KubeSpray – that was OK, but the bigger a cluster got, the slower the deployments got – with a 100 node cluster, it would take over 2 hours to deploy a cluster.

They found Talos Linux via a reddit post a few years ago.

They liked the principles of Talos Linux, being efficient, secure, and having appliance like management – they wanted to treat their nodes like cattle, not pets. So they had to test their architecture with Talos. 

Some key benefits they discovered:

  1. Immutable infrastructure – Talos runs as an immutable, minimal Linux distribution with the rootfs mounted read-only, preventing users from logging in and modifying nodes.
  2. Kubernetes integration – Talos is purpose-built for Kubernetes, making it easy to bootstrap highly-available Kubernetes clusters on bare-metal.
  3. Automatic machine lifecycle management – Nodes can automatically provision, maintain, and terminate themselves without manual intervention.
  4. Resource efficiency – By running control plane nodes as VMs on VMware, they could dedicate more bare-metal resources for their applications.

By adopting Talos, and automating the deployments with Sidero Metal, the CAPI provider for Talos on bare metal , they could declaratively define and provision Kubernetes clusters on bare-metal with a single command, and have it deployed in around 20 minutes. Scaling worker nodes was as simple as running another command.

They had a small team, so they needed to be able to operate efficiently, and Talos allowed them to scale operations without requiring a lot of time investment or extra staff. It was a dream come true – it prevented developers from messing with the boxes, and enabled them to manage their applications and not worry about the hardware and the overhead that comes with that.

Overall, Talos provided an appliance-like management experience and enabled Tremor Video to efficiently run their low-latency applications on bare-metal Kubernetes with immutable infrastructure and automatic lifecycle management.

Datacenters are expensive in personnell – you need reliable processes and systems – but they’ve saved millions of dollars compared to running everything in cloud, and their performance is notably better on bare metal. That is not even counting the data transfer they would incur, that would be very expensive in the cloud. ]

 

For more details information, watch the talk from TalosCon 2023.