In large-scale distributed systems, transient failures are an inevitable reality. At Netflix, which deploys thousands of microservices globally, Spinnaker has been the core continuous delivery (CD) engine. However, limitations in Spinnaker's Clouddriver service—particularly its complex internal orchestration logic and state management—led to approximately 4% of deployments failing due to Cloud Operation issues. This post explores Netflix's journey of adopting Temporal, a Durable Execution platform, to fundamentally solve this problem and dramatically reduce deployment failure rates from 4% to 0.0001%. For the original source, refer to the Netflix Tech Blog article.

Cloud computing and server infrastructure IT Technology Image

The Achilles' Heel of Spinnaker's Clouddriver

Within Spinnaker, Clouddriver was responsible for executing cloud infrastructure mutations (e.g., creating/deleting server groups). When Orca (the orchestration engine) requested work from Clouddriver, Clouddriver had a complex internal process to decompose and execute these as lower-level AtomicOperations. This architecture had several fundamental flaws:

  1. Complex Internal Orchestration: Clouddriver had to implement its own logic for task state tracking, retries, and rollbacks (a Saga pattern). This was "undifferentiated lifting" unrelated to its core goal of executing infrastructure changes.
  2. Instance-Local State: Task execution state was stored in the memory of a specific Clouddriver instance. If that instance went down, the in-progress task state was completely lost.
  3. Tight Coupling: Orca needed intimate knowledge of Clouddriver's specific status polling API and error handling patterns.

Despite this complexity, providing robust handling for transient failures—like network flakiness or cloud provider outages—remained challenging, ultimately manifesting as a 4% deployment failure rate.

Data center server rack Programming Illustration

The Paradigm Shift with Temporal: Coding 'As If Failures Don't Exist'

Temporal provides Durable Execution. Developers simply structure their business logic into Workflows (a deterministic series of steps) and Activities (non-deterministic external work, e.g., API calls). The Temporal server then durably stores and manages execution state. If a Worker process dies, or even if it's in the middle of a 30-day sleep, Temporal can preserve the state and resume execution on another Worker.

Post-Migration Architecture with Temporal:

  1. Instead of making direct requests to Clouddriver, Orca uses a Temporal client to start an UntypedCloudOperationRunner Workflow.
  2. A Clouddriver Worker picks up the Workflow task, interprets the payload, and executes the appropriate CloudOperation Child Workflow.
  3. The Child Workflow executes Activities that make the actual cloud provider API calls.
  4. Orca asynchronously awaits Workflow completion via its Temporal client.

This shift meant Clouddriver no longer had to implement complex orchestration, state management, or retry logic itself—all of that became the responsibility of the Temporal platform.

Network and system architecture diagram System Abstract Visual

Results and Key Learnings

Key Outcomes:

  • Deployment Failure Rate: 4% → 0.0001% (an improvement of roughly 40,000x).
  • Reduced Coupling: Orca and Clouddriver became loosely coupled via Temporal as an intermediary.
  • Statelessness: Clouddriver instances became stateless, allowing them to be treated like cattle and enabling practices like Chaos Monkey.
  • Improved Observability: The Temporal UI made it significantly easier to debug and monitor Workflows executing in production in real-time.

Practical Lessons Learned:

  1. Avoid Unnecessary Child Workflows: Using Child Workflows purely for code organization can complicate debugging. Consider class composition instead.
  2. Use Single Argument Objects: Changing a Workflow/Activity method signature can break long-running Workflows. Using a single serializable class to hold all arguments is a safer pattern.
  3. Separate Business Failures from Workflow Failures: Using a wrapper type like WorkflowResult to communicate business logic failures separately from Workflow execution failures allows for more nuanced error handling.

This successful migration catalyzed widespread adoption of Temporal at Netflix, which now leverages Temporal Cloud (SaaS) for hundreds of diverse use cases. It stands as a valuable case study for any architect designing distributed systems that must achieve reliability amidst complexity.