By Jacob Meyers and Rob Zienert
Temporal is a Durable Execution platform which allows you to write code “as if failures don’t exist”. It’s become increasingly critical to Netflix since its initial adoption in 2021, with users ranging from the operators of our Open Connect global CDN to our Live reliability teams now depending on Temporal to operate their business-critical services. In this post, I’ll give a high-level overview of what Temporal offers users, the problems we were experiencing operating Spinnaker that motivated its initial adoption at Netflix, and how Temporal helped us reduce the number of transient deployment failures at Netflix from 4% to 0.0001%.
Spinnaker is a multi-cloud continuous delivery platform that powers the vast majority of Netflix’s software deployments. It’s composed of several (mostly nautical themed) microservices. Let’s double-click on two in particular to understand the problems we were facing that led us to adopting Temporal.
In case you’re completely new to Spinnaker, Spinnaker’s fundamental tool for deployments is the Pipeline. A Pipeline is composed of a sequence of steps called Stages, which themselves can be decomposed into one or more Tasks, or other Stages. An example deployment pipeline for a production service may consist of these stages: Find Image -> Run Smoke Tests -> Run Canary -> Deploy to us-east-2 -> Wait -> Deploy to us-east-1.
Pipeline configuration is extremely flexible. You can have Stages run completely serially, one after another, or you can have a mix of concurrent and serial Stages. Stages can also be executed conditionally based on the result of previous stages. This brings us to our first Spinnaker service: Orca. Orca is the orca-stration engine of Spinnaker. It’s responsible for managing the execution of the Stages and Tasks that a Pipeline unrolls into and coordinating with other Spinnaker services to actually execute them.
One of those collaborating services is called Clouddriver. In the example Pipeline above, some of the Stages will require interfacing with cloud infrastructure. For example, the canary deployment involves creating ephemeral hosts to run an experiment, and a full deployment of a new version of the service may involve spinning up new servers and then tearing down the old ones. We call these sorts of operations that mutate cloud infrastructure Cloud Operations. Clouddriver’s job is to decompose and execute Cloud Operations sent to it by Orca as part of a deployment. Cloud Operations sent from Orca to Clouddriver are relatively high level (for example: createServerGroup), so Clouddriver understands how to translate these into lower-level cloud provider API calls.
Pain points in the interaction between Orca and Clouddriver and the implementation details of Cloud Operation execution in Clouddriver are what led us to look for new solutions and ultimately migrate to Temporal, so we’ll next look at the anatomy of a Cloud Operation. Cloud Operations in the OSS version of Spinnaker still work as described below, so motivated readers can follow along in source code, however our migration to Temporal is entirely closed-source following a fork from OSS in 2020 to allow Netflix to make larger pivots to the product such as this one.
A Cloud Operation’s execution goes something like this:
This works well enough on the happy path, but veer off the happy path and dragons begin to emerge:
We introduced a lot of incidental complexity into Clouddriver in an effort to keep Cloud Operation execution reliable, and despite all this deployments still failed around 4% of the time due to transient Cloud Operation failures.
Now, I can already hear you saying: “So what? Can’t people re-try their deployments if they fail?” While true, some pipelines take days to complete for complex deployments, and a failed Cloud Operation mid-way through requires re-running the whole thing. This was detrimental to engineering productivity at Netflix in a non-trivial way. Rather than continue trying to build a faster horse, we began to look elsewhere for our reliable orchestration requirements, which is where Temporal comes in.
Temporal is an open source product that offers a durable execution platform for your applications. Durable execution means that the platform will ensure your programs run to completion despite adverse conditions. With Temporal, you organize your business logic into Workflows, which are a deterministic series of steps. The steps inside of Workflows are called Activities, which is where you encapsulate all your non-deterministic logic that needs to happen in the course of executing your Workflows. As your Workflows execute in processes called Workers, the Temporal server durably stores their execution state so that in the event of failures your Workflows can be retried or even migrated to a different Worker. This makes Workflows incredibly resilient to the sorts of transient failures Clouddriver was susceptible to. Here’s a simple example Workflow in Java that runs an Activity to send an email once every 30 days:
@WorkflowInterface
public interface SleepForDaysWorkflow {
@WorkflowMethod
void run();
}
public class SleepForDaysWorkflowImpl implements SleepForDaysWorkflow {
private final SendEmailActivities emailActivities =
Workflow.newActivityStub(
SendEmailActivities.class,
ActivityOptions.newBuilder()
.setStartToCloseTimeout(Duration.ofSeconds(10))
.build());
@Override
public void run() {
while (true) {
// Activities already carry retries/timeouts via options.
emailActivities.sendEmail();
// Pause the workflow for 30 days before sending the next email.
Workflow.sleep(Duration.ofDays(30));
}
}
}
@ActivityInterface
public interface SendEmailActivities {
void sendEmail();
}
There’s some interesting things to note about this Workflow:
You might already begin to see how Temporal solves a lot of the problems we had with Clouddriver. Ultimately, we decided to pull the trigger on migrating Cloud Operation execution to Temporal.
Today, we execute Cloud Operations as Temporal workflows. Here’s what that looks like.
@WorkflowInterface
interface UntypedCloudOperationRunner {
/**
* Runs a cloud operation given an untyped payload.
*
* WorkflowResult is a thin wrapper around OutputType providing a standard contract for
* clients to determine if the CloudOperation was successful and fetching any errors.
*/
@WorkflowMethod
fun <OutputType : CloudOperationOutput> run(stageContext: Map<String, Any?>, operationType: String): WorkflowResult<OutputType>
}
2. The Clouddriver Temporal worker is constantly polling Temporal for work. A worker will eventually see a task for an UntypedCloudOperationRunner Workflow and start executing it.
3. Similar to before with resolution into AtomicOperations, Clouddriver does some pre-processing of the bag-of-fields in stageContext and resolves it to a strongly typed implementation of the CloudOperation Workflow interface based on the operationType input and the stageContext:
interface CloudOperation<I : CloudOperationInput, O : CloudOperationOutput> {
@WorkflowMethod
fun operate(input: I, credentials: AccountCredentials<out Any>): O
} 4. Clouddriver starts a Child Workflow execution of the CloudOperation implementation it resolved. The child workflow will execute Activities which handle the actual cloud provider API calls to mutate infrastructure.
5. Orca uses its Temporal Client to await completion of the UntypedCloudOperationRunner Workflow. Once it’s complete, Temporal notifies the client and sends the result and Orca can continue progressing the deployment.
A shiny new architecture is great, but equally important is the non-glamorous work of refactoring legacy systems to fit the new architecture. How did we integrate Temporal into critical dependencies of all Netflix engineers transparently?
The answer, of course, is a combination of abstraction and dynamic configuration. We built a CloudOperationRunner interface in Orca to encapsulate whether the Cloud Operation was being executed via the legacy path or Temporal. At runtime, Fast Properties (Netflix’s dynamic configuration system) determined which path a stage that needed to execute a Cloud Operation would take. We could set these properties quite granularly — by Stage type, cloud provider account, Spinnaker application, Cloud Operation type (createServerGroup), and cloud provider (either AWS or Titus in our case). The Spinnaker services themselves were the first to be deployed using Temporal, and within two quarters, all applications at Netflix were onboarded.
What did we have to show for it all? With Temporal as the orchestration engine for Cloud Operations, the percentage of deployments that failed due to transient Cloud Operation failures dropped from 4% to 0.0001%. For those keeping track at home, that’s a four and a half order of magnitude reduction. Virtually eliminating this failure mode for deployments was a huge win for developer productivity, especially for teams with long and complex deployment pipelines.
Beyond the improvement in deployment success metrics, we saw a number of other benefits:
With the benefit of hindsight, there are also some lessons we can share from this migration:
1. Avoid unnecessary Child Workflows: Structuring Cloud Operations as an UntypedCloudOperationRunner Workflow that starts Child Workflows to actually execute the Cloud Operation’s logic was unnecessary and the indirection made troubleshooting more difficult. There are situations where Child Workflows are appropriate, but in this case we were using them as a tool for code organization, which is generally unnecessary. We could’ve achieved the same effect with class composition in the top-level parent Workflow.
2. Use single argument objects: At first, we structured Workflow and Activity functions with variable arguments, much as you’d write normal functions. This can be problematic for Temporal because of Temporal’s determinism constraints. Adding or removing an argument from a function signature is not a backward-compatible change, and doing so can break long-running workflows — and it’s not immediately obvious in code review your change is problematic. The preferred pattern is to use a single serializable class to host all your arguments for Workflows and Activities — these can be more freely changed without breaking determinism.
3. Separate business failures from workflow failures: We like the pattern of the WorkflowResult type that UntypedCloudOperationRunner returns in the interface above. It allows us to communicate business process failures without failing the Workflow itself and have more overall nuance in error handling. This is a pattern we’ve carried over to other Workflows we’ve implemented since.
Temporal adoption has skyrocketed at Netflix since its initial introduction for Spinnaker. Today, we have hundreds of use cases, and we’ve seen adoption double in the last year with no signs of slowing down.
One major difference between initial adoption and today is that Netflix migrated from an on-prem Temporal deployment to using Temporal Cloud, which is Temporal’s SaaS offering of the Temporal server. This has let us scale Temporal adoption while running a lean team. We’ve also built up a robust internal platform around Temporal Cloud to integrate with Netflix’s internal ecosystem and make onboarding for our developers as easy as possible. Stay tuned for a future post digging into more specifics of our Netflix Temporal platform.
We all stand on the shoulders of giants in software. I want to call out that I’m retelling the work of my two stunning colleagues Chris Smalley and Rob Zienert in this post, who were the two aforementioned engineers who introduced Temporal and carried out the migration.
How Temporal Powers Reliable Cloud Operations at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Embeddings — vector-based numerical representations of typically unstructured data like text — have been primarily…
Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they…
This post is co-written with Sunaina Kavi, AI/ML Product Manager at Omada Health. Omada Health,…
Anthropic released Cowork on Monday, a new AI agent capability that extends the power of…
New York governor Kathy Hochul says she will propose a new law allowing limited autonomous…
Artificial intelligence (AI) is increasingly used to analyze medical images, materials data and scientific measurements,…