This is the third post in our blog series on Rubix (#1, #2), our effort to rebuild our cloud architecture around Kubernetes.
The advent of containers and their orchestration platforms, most popularly Kubernetes (K8s), has familiarized engineers with the concept of ephemeral compute: once a workload has completed, the resources used to run it can be destroyed or recycled. Ephemeral compute constructs, such as K8s pods, make it possible to:
The concept of ephemeral compute also positively impacts engineering discipline. Because pods might get terminated at any time, engineers must scrutinize an application’s resiliency at all stages of development.
The Kubernetes pod is often the only level at which engineers interact with the concept of ephemeral compute. When first designing Rubix, Palantir’s Kubernetes-based infrastructure platform, we asked ourselves: what value can we get from applying the concept of ephemeral compute further down the stack?
This post covers Rubix’s commitment to managing K8s clusters composed of immutable, short-lived K8s nodes, or “compute instances.” We discuss our conviction in this approach, its attendant challenges, and how it has made production Rubix environments more resilient and easier to operate.
Managing a Kubernetes cluster composed of short-lived nodes sounds hard. So, why bother? We can gain conviction in the value of this endeavor by recounting the benefits of a more familiar ephemeral compute construct: the Kubernetes pod. These benefits boil down to a pod’s immutability:
We can realize similar benefits by treating K8s nodes as ephemeral compute instances. Immutable nodes derived from instance templates simplify the debugging of any node issues encountered in production. The ability to destroy and replace nodes removes the need for upgrading instances in place.
In addition, we added the constraint that nodes in Rubix environments cannot live longer than 48 hours. There are several benefits of this constraint:
Having convinced ourselves that running K8s clusters composed of ephemeral nodes has such upside, let’s discuss the problems we had to solve to achieve this goal.
Terminating instances every two days makes it challenging for applications running in Rubix to maintain availability. This requires developers to write code with high availability in mind, but our infrastructure must still ensure that instance terminations don’t completely destabilize hosted services.
We solved for this by developing a termination pipeline with three components:
This decomposition provides us both operational and stability benefits.
Operationally, our node selection policy abstraction allows us to encode learnings from cluster operations. Upon identifying categories of bad node states in production, we define policies to automate the termination and replacement of such nodes. For example, the old AWS EBS volume plugin (since deprecated in favor of the EBS CSI driver) applied NoSchedule taints to a node if it detected the node had volumes stuck in an attaching state. The Rubix team added a termination policy selecting nodes with such taints, removing the need for manual intervention in future instances of the same issue.
From a stability perspective, our termination logic accounts for application availability through its use of eviction APIs, respecting any relevant PodDisruptionBudgets (PDBs) during node draining.
Readers are likely already familiar with PDBs and evictions; the novelty of our termination pipeline solution is the encoding of policies to relieve environment operators of the burden of manually identifying and replacing nonfunctional cloud instances.
Before the arrival of container orchestration platforms, application upgrades were painful and error-prone. Pitfalls included:
Engineers who have experienced these upgrades can appreciate the simplicity of a K8s Deployment rolling update. Old pods are simply destroyed and replaced with new ones, and the pod template is the sole source of configuration for each replica. Rubix’s commitment to ephemeral compute brings these upgrade semantics to the K8s node level.
Consider a routine OS upgrade. This results in a new template for our cloud instances (whether an AWS launch template or a GCP instance template). From there, expediting upgrades simply involves adding a new termination policy that prioritizes instances derived from older templates. The cloud provider then uses the new template when replacing these terminated instances.
This upgrade approach doesn’t cover all classes of instance upgrades. For example, some changes to launch configurations result in cloud providers terminating instances unilaterally, obviating our graceful termination logic described above.
We addressed this by again leveraging the concept of immutability, this time at the instance group level. By making certain instance group configuration parameters immutable, we could implement instance group upgrades by replacement.
Once again, extending our application of ephemerality and immutability of compute resources has proved valuable. Similar to Deployment-managed ReplicaSets, instance groups now have no configuration history. All of an instance group’s members are derived from a single, immutable template, making them easier to debug.
Moreover, when upgrades occur on a gradual rolling basis between two distinct instance groups, it is possible to separately evaluate the two instance groups. We can now track the health of each group, helping us determine the correctness of the new instance group’s configuration and roll back if our telemetry detects issues. The example below demonstrates failed upgrade behavior using launch failure rate as our instance group health metric.
Our broad application of the concept of ephemeral compute brought us to a place where instance upgrades and rollbacks happen automatically. At this point, we asked ourselves: can we leverage this system to improve our resiliency to cloud capacity issues?
Cloud capacity issues can arise due to any number of causes, but often occur along the dimensions of availability zones (AZs) or types of instance offerings (whether an actual instance type, such as one with a particular Memory to CPU ratio, or a type of compute offering layered on top of instances, such as AWS EC2 spot). How would treating this class of capacity-related traits as immutable work?
The parameters most relevant to cloud capacity issues often have enumerable values; production environments typically span three AZs, and opting for spot instances is a binary decision. Breaking up our existing instance groups along these new dimensions thus yields a bounded number of instance groups to which we can route capacity based on instance group health.
While splitting out instance groups introduced new responsibilities (such as balancing scaling across AZs), it hardened our infrastructure by enabling our control plane to automatically route capacity to healthy instance groups when others experience cloud provider outages.
Palantir treated Rubix as an opportunity to extend ephemeral compute, a concept with proven production value, beyond the pod level to the K8s node level. After tackling the concomitant challenges, we realized this effort’s benefits in terms of improved security, easier upgrades, and better resiliency to cloud provider outages.
In a future post, we’ll cover how this foundation of ephemeral compute infrastructure has enabled us to tackle and rein in cloud costs. Stay tuned!
Interested in helping build the Rubix platform and other mission-critical software at Palantir? Head to our careers page to learn more.
The Benefits of Running Kubernetes on Ephemeral Compute was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Our new AI system accurately identifies errors inside quantum computers, helping to make this new…
Estimating the density of a distribution from samples is a fundamental problem in statistics. In…
Swiss Re & PalantirScaling Data Operations with FoundryEditor’s note: This guest post is authored by our customer,…
As generative AI models advance in creating multimedia content, the difference between good and great…
Large language models (LLMs) give developers immense power and scalability, but managing resource consumption is…
We dive into the most significant takeaways from Microsoft Ignite, and Microsoft's emerging leadership in…