Lessons in resilience at SoundCloud

Building and operating services distributed across a network is hard. Failures are inevitable. The way forward is having resiliency as a key part of design decisions.

This post talks about two key aspects of resiliency when doing RPC at scale - the circuit breaker pattern, and its power combined with client-side load balancing.

Circuit breakers v1.0

A basic definition of the Circuit Breaker pattern and its applicability is explained in Martin Fowler’s post on the topic.

SoundCloud’s original implementation meant a client’s view of a remote service transitioned between three pre-defined states:

  • Closed -> Everything seems fine, let all requests go through. After n consecutive failures, transition into Open.
  • Open -> The request, if let through, has a high probability of failing, given recent data. Fail the request pre-emptively and allow the remote service to recover. The circuit remains open for a pre-defined duration, and all requests are failed during that time. Thereafter an attempt to reset is made by transitioning into Half-Open mode.
  • Half-Open -> Something bad happened a while back, let’s be cautious and let a single request through. If it passes, transition to Closed, else to Open.

With this implementation, a single failure becomes the deciding factor for opening a circuit, blocking all requests to the remote service for a period. This meant degraded end user experience and having to manually restart client systems to reset the state.

The missing parts

As you start operating systems under high load, the following questions arise:

  • Can we make more accurate predictions about the health of a remote service by using a moving window and tracking success as a percentage of total requests, as opposed to opening on n consecutive failures with unbounded duration?
  • What happens when one or a few instances (limited to a specific network zone in a data-center, maybe) of a service are unhealthy, but a majority of other instances are healthy and available? Is it fair to fail all requests to a service in such a situation? This is especially relevant in a microservices architecture, where scaling horizontally is common.
  • How can remote services convey more information about their health? A HTTP-based service may be up and running but repeatedly responding with 5xx. Can we make clients smarter by using application specific information to decide when it’s a good idea to avoid overloading services?

A better alternative

In a series of previous posts, we’ve described how Finagle plays a key role in building our services.

Given JSON over HTTP is used for inter-service communication at SoundCloud by most services, resiliency at this layer is critical to our uptime and dealing with all possible classes of failures.

Earlier this year, we started exploring solutions aimed at improving resilience for our HTTP clients.

We found that most of the gaps detailed above could be addressed through a combination of modules included in the Finagle stack for building clients.

This was good news, but we still needed custom tweaks. The most important of these was to identify and compose multiple available modules, to help us define policies for how clients behave in failure scenarios and meet our expectations for high availability and a great user experience.

Specifically, we could map each question to a module in Finagle that could potentially address the gaps:

Better than n consecutive failures

The problem : An unbounded duration, instead of a moving window, means clients infer inaccurate information about the health of remote services. A failure accrual module in Finagle allows clients to define a policy for success in terms of:

  • a moving window specified as a duration or number of requests
  • a success rate, expressed as a percentage of requests that should return an acceptable response
  • a duration for which to mark a specific instance of a service as unavailable

This captures all dimensions by which we want to define success thresholds.

Client’s view of individual service instances

The problem : A client’s aggregated view of a remote service means a single rogue instance can trick you into believing that the whole service is unhealthy.

Finagle monitors health of services, per instance, and applies the success policy defined by clients to decide if it can allow requests through to a given instance.

A few unhealthy nodes can no longer render your service completely unavailable. Big win!

In conjunction with client side load balancing, this creates a dynamic view of a service for clients.

Clients can now continue fetching data, with requests smartly routed only to service instances marked healthy, while unhealthy instances are temporarily ignored and allowed to recover.

Using the Application layer for decision making

The problem : network layer information alone is often not good enough to make informed decisions about health of remote services.

The ability for clients to define what classifies as failures, using information available in Application layer protocols, is powerful. This gives fine-grained control, and is an improvement over the standard network layer signals to classify remote invocations as success or failure.

Clients should be able to classify failures from services, based on protocol specific data (e.g.: a HTTP 5xx response code).

This provides much needed application specific context to circuit breakers and results in accurate decisions regarding the health of individual instances of a service.

Support for response classification was introduced in Finagle earlier this year, just in time for us to put it to good use.

Wrapping up

It is important to acknowledge that even the task of composing appropriate modules and configuring thresholds takes a fair bit of learning until you get it right.

We switched to using a moving window of duration instead of number of requests, as we realised it was hard to define thresholds that worked equally well for high and low traffic services.

Similar implementations exist for services that talk to database systems, cache nodes etc.

Given that we were already on the Finagle stack, the capabilities listed were compelling. A combination of Ribbon + Hystrix (Java-based libraries, aimed at fault tolerance, from Netflix) make for a decent alternative.

The above points aim to provide a high level summary of capabilities to look for when choosing to implement or use a framework aimed at improving resiliency.

Related concerns

  • What should be the policy around clients retrying requests?
  • When dealing with sudden surge in incoming traffic, how can we cordon off services at the edge layer and keep critical core systems available?

We hope to share our experiences on these aspects, as we move forward, and scale out further.