Real-Time Counts with Stitch

Here at SoundCloud, in order to provide counts and a time series of counts in real time, we created something called Stitch.

Stitch was initially developed to provide timelines and counts for our stats pages, which are where users can see which of their tracks are played and when.

SoundCloud Stats Screenshot

Stitch is a wrapper around a Cassandra database. It has a web application that provides read access to the counts through an HTTP API. The counts are written to Cassandra in two distinct ways, and it’s possible to use either one or both of them:

Real Time
For real-time updates, Stitch has a processor application that handles a stream of events coming from a broker and increments the appropriate counts in Cassandra.
Batch
The batch part is a MapReduce job running on [Hadoop] that reads event logs and calculates the overall totals, and then bulk loads this into Cassandra.

The Problem

The difficulty with real-time counts is that incrementing is a non-idempotent operation, which means that if you apply the same increment twice, you get a different value than if you would only apply it once. That said, if an incident affects our data pipeline and the counts are wrong, we can’t fix it by simply re-feeding the day’s events through the processors; if we did, we would risk double counting.

Our First Solution

Initially, Stitch only supported real-time updates and addressed this problem with a MapReduce job, The Restorator, which performed the following actions:

  1. Calculated the expected totals.
  2. Queried Cassandra to get the values it had for each counter.
  3. Calculated the increments needed to apply to fix the counters.
  4. Applied the increments.

Meanwhile, to stop the sand shifting under its feet, The Restorator needed to coordinate a locking system between itself and the real-time processors. This was so that the processors didn’t try to simultaneously apply increments to the same counter, which would result in a race condition. To deal with this, The Restorator used ZooKeeper.

As you can probably tell, this setup was quite complex, and it often took a long time to run. But despite this, it worked.

Our Second Solution

Luckily, a new use case emerged: a team wanted to run Stitch purely in batch. This is when we added the batch layer, and we used this as an opportunity to revisit the way Stitch was dealing with the non-idempotent increments problem. We evolved to a Lambda Architecture-style approach, where we combined a fast real-time layer for a possibly inaccurate but immediate count with a batch slow layer for an accurate but delayed count. The two sets of counts are kept separately and updated independently, possibly even living on different database clusters, and it is up to the reading web application to return the correct version when queried. At its most naive, it returns the batch counts instead of the real-time counts, whenever they exist.

Conclusion

To find out how Stitch has evolved over the years, you can read this updated post, Keeping Counts In Sync.

Stitch Diagram

Thanks go to Kim Altintop and Omid Aladini, who created Stitch, and John Glover, who continues to work on it with me.

If this sounds like the sort of thing you’d like to work on too, check out our jobs page.