Thursday, December 8, 2011

Viz: Making the Netflix API More Resilient

How netflix configures its services, and the visualizations it uses to monitor them. 

Making the Netflix API More Resilient

by Ben Schmaus

The API brokers catalog and subscriber metadata between internal services and Netflix applications on hundreds of device types. If any of these internal services fail there is a risk that the failure could propagate to the API and break the user experience for members.

To provide the best possible streaming experience for our members, it is critical for us to keep the API online and serving traffic at all times. Maintaining high availability and resiliency for a system that handles a billion requests a day is one of the goals of the API team, and we have made great progress toward achieving this goal over the last few months.

Principles of Resiliency

Here are some of the key principles that informed our thinking as we set out to make the API more resilient.

  1. A failure in a service dependency should not break the user experience for members
  2. The API should automatically take corrective action when one of its service dependencies fails
  3. The API should be able to show us what’s happening right now, in addition to what was happening 15-30 minutes ago, yesterday, last week, etc.

Keep the Streams Flowing

As stated in the first principle above, we want members to be able to continue instantly watching movies and TV shows streaming from Netflix when server failures occur, even if the experience is slightly degraded and less personalized. To accomplish this we’ve restructured the API to enable graceful fallback mechanisms to kick in when a service dependency fails. We decorate calls to service dependencies with code that tracks the result of each call. When we detect that a service is failing too often we stop calling it and serve fallback responses while giving the failing service time to recover. We then periodically let some calls to the service go through and if they succeed then we open traffic for all calls.

If this pattern sounds familiar to you, you're probably thinking of the CircuitBreaker pattern from Michael Nygard’s book "Release It! Design and Deploy Production-Ready Software", which influenced the implementation of our service dependency decorator code. Our implementation goes a little further than the basic CircuitBreaker pattern in that fallbacks can be triggered in a few ways:

  1. A request to the remote service times out
  2. The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity
  3. The client library used to interact with a service dependency throws an exception

These buckets of failures factor into a service's overall error rate and when the error rate exceeds a defined threshold then...