SoundCloud for Developers

Discover, connect and build

Backstage Blog January 24th, 2018 Search Data Science Machine Learning Analytics Data PageRank in Spark By Josh Devins

SoundCloud is a platform consisting of hundreds of millions of tracks, people, albums, and playlists. Navigating this vast collection of music and personalities poses a large challenge, particularly when it comes to music with so many covers, remixes and original works all in one place. For search on SoundCloud, one of the ways we approach this problem is by using the PageRank algorithm, with our version affectionately known as DiscoRank (Get it? Disco as in discovery and Saturday Night Fever?!). The job of PageRank is to help rank search results from queries such as, finding all Go+ tracks called "royals". At first glance, this task might seem trivial. The first result should indeed be Lorde's original Royals, but of all the covers and remixes, which should we show at the top and in which order? What about other tracks in our catalog that have the word "royals" in them? Where should they be in our search results list?

Our use of PageRank started way back in the summer of 2012 when we introduced SoundCloud Next (now our current website). As the years have gone by, we have grown exponentially and so our implementation of PageRank has had to scale with that enormous growth. Over those years, we have implemented PageRank in several ways from an in-memory Java implementation, to distributed in Hadoop using the vanilla MapReduce API, and more recently using Scalding. Last summer we made the switch to our latest version of PageRank that runs in Spark and now we’re open-sourcing our implementation as a reusable library!

Before implementing yet another version ourselves, we first tried out the built-in PageRank from Spark. Unfortunately, we quickly realized that it did not support the graph that we were using as we have nodes with no out-edges, some with no in-edges, and edge weights that are not uniform. After this first try, we decided to implement PageRank ourselves and tried a variety of the Spark APIs, including GraphX. Although our GraphX implementation was working in small-scale test cases, when we ran at full-scale, we encountered several bugs in GraphX that are currently unresolved. Based on this experience, we decided that a simple approach using the RDD API would suit us best. We spent about an equal amount of time on implementation as we did on performance evaluation and tuning, but in the end, Spark was a perfect fit for this job. We were able to utilize many of the features that Spark is known for, such as in-memory data storage, easy parallel computation, and implementing iterative learning algorithms. As a result of this work, we have made improvements to the performance, scalability, and general flexibility of our PageRank implementation. Able to perform one iteration of PageRank in approximately 3 to 5 minutes on a graph of over 700 million nodes and 15 billion edges, this new implementation has enabled us to update ranks in search more frequently throughout the day and prepared us for growth over the coming years.

With open-sourcing our implementation of PageRank, we hope that you can benefit from the work we’ve put into this for your own use-cases, from search to recommendations. Go check out the project, try it out, open issues, and get involved!