Production Infrastructure at Palantir: User-minded API design

Intro

It’s trendy right now for organizations to describe themselves as “outcome-oriented,” where members of the organization are accountable for the results of their work. The Production Infrastructure group at Palantir commits itself to being oriented around our users’ outcomes — but how does this translate into building good software?

Our users’ outcomes are often dictated by the complexity of the APIs we provide. For example, Kubernetes’ robust and generalized APIs require a wealth of experience to leverage effectively. When designing features, it’s important to remain wary of our tendency as engineers to build the more general, “powerful” API at the expense of usability. This post discusses Palantir’s commitment to focusing engineering efforts on enabling user outcomes. We look at a small example of how this influences feature planning and API design within Palantir’s Production Infrastructure group.

Problem Statement

Feature requests are often specific to particular user outcomes. They also trickle in over time; the idea of the first generally available version of a piece of software also being feature-complete strikes engineers as absurd. Thus, product development teams are rightfully wary of “overfitting” features or APIs to the present feature request. To avoid this pitfall of writing bespoke, non-reusable features, developers strive to implement more general, flexible solutions. This sounds great in principle, but it has limitations:

  • Generalizing a user’s problem statement could turn it into one the user doesn’t have
  • Making APIs too powerful/flexible requires the user to build their own expertise in order just to use them correctly

Knowing these challenges, how can we avoid the pitfall of shipping overly generalized features that don’t serve our users?

Hearing Users

Processes within Palantir Production Infrastructure (PI) help ensure we are designing software that meets our users’ actual needs. We begin each set of feature work with a “request for comments,” or RFC. Each of these RFCs includes both a problem statement and a proposed solution. We encourage RFC reviewers to scrutinize the problem statement just as much as they do the proposed solution. This scrutiny during review helps answer a few questions critical to the sustainability of a design decision:

  • Are we accurately distilling and synthesizing our users’ needs?
  • If we’re generalizing users’ needs, does this saddle them with too much of a burden for correct usage?
  • Does a generalized problem statement expose concepts that are best kept internal?

Reasoning about this is difficult without a concrete example in front of us. Since PI also believes our users’ outcomes depend on the sum of myriad small design decisions, we’ll zoom in on a recent engineering endeavor. Specifically, we’ll look at this endeavor’s determination of an appropriate API that was expressive enough to satisfy the user’s needs without burdening the user with too much complexity.

Endeavor: Spark Application Scheduling Improvements

PI’s effort earlier this year to improve its scheduling of Spark application pods in Kubernetes clusters provides an opportunity to look at systematic API design. We have some prior art in this space as a result of implementing gang scheduling Spark Application pods in a first in, first out manner and allocating compute nodes to only healthy availability zones (AZs) by building on top of our management of Kubernetes clusters composed of ephemeral compute.

Building on this foundation, we wanted to enable scheduling all the pods of a given Spark application in a single availability zone. Our motivation for doing this was as follows: a Spark application’s pods transmit large amounts of data between each other, and collocating all the pods of an application within a single AZ would get rid of substantial data transfer costs.

Our proposed solution would leverage existing APIs between our Spark scheduler and our cluster autoscaling components (we highly recommend reading our post on Spark scheduling in Kubernetes before continuing!):

  • ResourceReservations, which reserve resources on compute nodes for the sake of gang scheduling a Spark application’s pods
  • Demands, which represent unmet compute resource requirements

For scheduling applications entirely within a single zone, our Spark scheduler could use K8s labels to list available nodes by zone and then invoke its existing bin-packing logic once per zone:

Before: The scheduler considers nodes across all zones when bin-packing.
After: Since the nodes provided to each bin-packing call are all in the same zone, the ResourceReservation resulting from successful bin-packing would include nodes from just one zone.

This approach addresses the case when at least one AZ already has enough spare compute capacity to fit the application but what about when that isn’t the case? The Spark scheduler has used Demands to describe such unmet resource requirements but our autoscaling components must now satisfy a given Demand in a single zone. Thus, when running in “single zone” scheduling mode, our scheduler needed a way to express this zonal constraint of Demands. What was the best approach to extending our Demand API for this specific use case?

Weighing the Options

There are a couple of API options to enable the scheduler to request a Demand is met within just one AZ:

  • Add a string field Zone to the Demand spec: the scheduler would use this field to specify the zone in which it wants the Demand met
  • Add a boolean field EnforceSingleZoneScheduling to the Demand spec: the scheduler would set this to true when it wants the Demand met only within one zone

The first option here is tempting; it’s more “powerful” regarding the control it gives our scheduler over scheduling decisions. However, it has some significant downsides.

The scheduler becomes responsible for picking a zone for each Spark application; how can it do this reliably? What additional information does it need to make this decision? It turns out that information about the health and scalability of each AZ is required. This information was previously kept behind APIs internal to our autoscaling components. At best, leaking this information requires our user (the scheduler) to duplicate logic already encoded in our autoscaling; at worse, the scheduler implements a subtly wrong version of this logic that causes production outages in the form of persistent scheduling failures. Regardless, we’ve exposed a previously internal API to the scheduler; this exposure introduces dependencies that constrain future autoscaling component feature development.

This API design pitfall derives from misinterpreting the user (the scheduler)‘s specific goal: obtain compute that suffices for scheduling an Application entirely within one zone, whichever zone. Our desire to make a more generalized, “powerful” API leads to worse outcomes for our user and the maintainers of cluster autoscaling.

The second option, our EnforceSingleZoneScheduling boolean, avoids these downsides while providing just enough expressiveness to enable our user’s desired outcome. Our cluster autoscaling components keep more information abstracted away from the scheduler, protecting flexibility in future development efforts.

This decision seems simple in hindsight but it stems from scrutinizing and distilling the user’s desired outcome and building a solution for that outcome.

Recap

As software engineers, we naturally tend towards “generalized” or “powerful” APIs and often conflate these descriptors with user success. There are cases where this is detrimental to both the user and the developer: the user is saddled with a more significant knowledge burden, and developers leak abstractions to the detriment of their future productivity. The real merit of an API is how effectively it enables its users’ outcomes. As we see above, PI at Palantir aims to distill and address users’ desired outcomes at all scales, even in the smallest instances.


Production Infrastructure at Palantir: User-minded API design was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.