A journey through the SoundCloud network
Though SoundCloud’s journey has been anything but stable, one key advantage they have over other music streaming services is a prime combination of content distribution and social networks. As a novice producer starting out myself, I asked the question on everyone’s mind: How do songs go viral? Using networks, I tried to figure it out.
My original hypothesis was based on the idea of “Mavens”, coined in Malcolm Gladwell’s Tipping Point.
“Mavens are […] information specialists who we rely on to connect us to new information.”
In this case, a Maven would be the one friend who always shares the music they’ve listened to recently, or the one whose playlists you have on repeat. These are people who seek out new music and, importantly, actively disseminate it to their immediate network.
Hypothesis: There are certain key people (Mavens) who need to share a song to make it go viral.
In this model, Mavens would serve as “bridges” between different, clustered communities of SoundCloud users. When the Mavens hear and share the song, they expose it to vastly different groups of users who are unlikely to hear the song from another source, thus causing a viral spread in that community. This phenomenon is widely present in other social networks, so this hypothesis follows as an extrapolation to a network built around music.
In order to test my hypothesis, I had to find a reliable way to identify Mavens in a network. The networks in question were the immediate followers of active artists. I determined that Mavens could be identified by looking at perturbations in what I called the “Reposting Timeline” of songs posted by the artist. In SoundCloud, users are allowed to “repost” songs to their timeline, which then shows up on their followers feed. Each repost is registered with the time it was posted, so the reposting timeline is a history of user reposts for a single song.
Clearly, reposting is how active users share new music that they find with their direct followers. Thus, if a user is a Maven, we expect a spike in the timeline after they repost due to the exposure of the song to a new audience. In this way, we can visually point out users that are likely to be Mavens.
Once I have both the networks and the reposting timelines for various artists and various songs, I would then be able to analyze the Mavens’ networks and compare a host of basic network characteristics across Mavens of different artists, genres, and sizes.
Here I’ll dive into some of the code techniques and how I was able to speed up such a large scale web scraping project. If you’re not interested in the details, feel free to skip right to the results section below.
All the web scraping I did was guided by three main objectives:
- Generating and saving reposting timelines
- Generating and saving follower networks of given artists
- Generating basic network characteristics for analysis
My first stab at the above objectives made use of the common but powerful web scraping library, Selenium. This tool gives easy methods for searching for specific elements in a page but, as you’ll read later, does not have the best solution for infinite scrolling pages.
The order of operations for getting the reposting timeline of each track looked something like this:
- Go to www.soundcloud.com/<artist-uri>/<track-uri>
- Get full list of uris of users who reposted the song
- For each user in the reposter list:
- Scroll through their timeline
- Find their repost of the song in question
- Save the timestamp along with the user uri
Immediately, we can start to optimize each step in this process. For Step 1, instead of manually getting the uri for each song, we can simply get the uri for each artist and scrape for the uri of every song they have posted under the link www.soundcloud.com/<artist-uri>/tracks. This will give us a list of songs to start generating reposting timelines for.
Steps 2 and 3 are a little trickier. In order to get a list of all users who reposted, we have to go to www.soundcloud.com/<artist-uri>/<track-uri>/reposts, which has a list of all reposters and their uris in an infinite scrolling format. As you scroll to the bottom, more users are loaded in batches of 12 until the entire list is exhausted. This means the scraping code will have to programmatically do this scrolling action for every song in order to get the full list. Once the list of reposters is found, finding the exact timestamp associated with the each repost requires scrolling through each user’s timeline until the song is found, which is also served to the client as an infinite scrolling list.
This interaction can become a problem if the number of reposters is high or if a particular user reposts often. In either case, dozens of 12 entry loads will have to complete before all reposters or the right song is found. Though Selenium has some nifty built in methods for scrolling through a page, the program still has to wait for the page to load every 12 user chunk. This level of efficiency lead to reposting timelines of 300–400 reposts to require 6+ hours of scraping time.
Ideally, for reposting timelines, I would enter a list of artist names, run the program for an hour or so, and have reposting timelines for all songs by those artists saved in a readable format. For follower networks, I would enter artist names and how many connections away from the original artist I wanted to search, run the program for a similar amount of time, and have a full list of edges between the set of users within my input number of connections from the artist node. The key bottleneck for both objectives became the number of loads necessary due to infinite scrolling. The better approach? Reverse-engineering the internal API.
Reverse-Engineering the Internal API
Obviously, all this data could be easily gathered if SoundCloud had a publicly available API. They do, but they have shut off new users from getting a key with no plan on re-opening the application anytime soon. So we resort to web scraping which essentially emulates a user browsing the web for information, but programmatically and thus faster.
What if we could figure out how to access SoundCloud’s internal API? The use of this internal API is not obscured in any way. Simply going to the network tab in the developer console will reveal the calls:
Zoning in on one of these calls with the api-v2 endpoint, we can start dissecting the query parameters and make some educated guesses as to what they represent:
- 60121854 → userId for tk03
- client_id → string used to represent the current client (i.e: your laptop)
- limit → number of users to load
- offset → index of the first user to load (assuming the list of followers is stored as an array)
- app_version → when you logged into the app/when your session started (can be converted to a date)
Once we make these guesses, we can start trying different values for the parameters. Pasting the new url into a separate tab will reveal a fully parsed JSON object with the data we need. If I increase the limit to 20, 20 users are loaded. If I change the client_id to another user id, the new user’s followers are loaded.
With enough tweaking, a few restrictions reveal themselves as well:
- The maximum number of users you can load at once (limit) is 200
- offset refers to the timestamp by which you are searching. The call will search for up to <limit> number of posts in reverse chronological order, starting at <offset> time.
What’s more, the response object comes with a field called next_href that gives a url that will load the next batch of responses with the right offset. This offset corresponds to the time at which the last user in the previous set started following the artist. The same loading mechanism is used across the site, including loading reposters and likes.
Using this information, we can reformat the reposter list code to avoid the infinite scrolling problem completely. The flow now goes something like this:
- Go to api-v2.soundcloud.com/tracks/<track-id>/reposters?client_id=<client-id>&app_version=<app-version>&app_locale=en&offset=0&limit=1
- While next_href is not null:
- From next_href, get the time of the latest repost
- save the latest user and timestamp
- repeat process starting at next_href
With one parse through the entire list of reposters, we get both the timestamp of each repost and the user associated with it. A similar approach can be applied to get a follower timeline for an artist or a timeline of “likes” for a song. Getting just a list of followers is even simpler. Since we don’t need the timestamp data, we can max out the number of users loaded per batch (200) and collect a list of users quickly.
Before we can call this approach complete, we need a way of programmatically gathering the track-id, client_id, and app_version to add to our requests. The manual way we have been doing it is by going through the network call stack in the developer console until we find a call that has these key pieces of information. By changing the permissions of the Headless Chrome driver, which we use to simulate a webpage to scrape, we can get that network call stack locally. Once we have the list of calls, we notice that the specific call /reposters occurs with every request for reposters and has all three pieces of session information we are looking for. A simple filter for this call will give us the url. Parsing the url will give us the parameters, which we can then store for further use in the session.
With this workaround, we cut the scraping time to 1/10th of what it was using Selenium, all because we were able to successfully dissect the internal API. The drastic improvement in speed can be explained by two key points:
- we don’t have to load all images and html elements for every batch
- we don’t have to wait for infinite scrolling
With Selenium, our program scrapes the same way we would manually read a page, just faster. With the API, we are able to scrape as the server would, loading only the data essential for collection.
Fast, Containerized Network Calculations
Generating and comparing key network characteristics is one of the main goals of this project and many other network-related projects, but can prove to be extremely time intensive. Here’s an outline of some common network characteristics and how they are calculated:
- The proportion of a given node’s neighbors that have an edge between each other. If a node has 4 neighbors, there are a total of 6 (undirected) edges that can exist between them. the node has a clustering coefficient of 1.0 if all 6 edges exist.
- This requires calculating the edges between first degree neighbors of every node.
- Proportion of nodes with lower degree that connect to nodes of higher degree. Lower the assortativity, the more difference (on average) between the degree of connected nodes.
- This requires comparing the degree of nodes on either end of every edge.
- Subgraphs in the network, usually of 3 nodes, with the same arrangement of directed or undirected edges.
- This involves counting all such subgraphs and categorizing them based on their edge topology.
Obviously, generating these statistics can become extremely time intensive once the network grows to thousands of nodes with tens of thousands of edges. The traditional python library for network manipulation and analysis is NetworkX. This library is great for quick local simulations and small test networks, but cannot handle the load of analyzing a large network. A significantly faster library, graph-tool, has implementations in C/C++. They have a nifty comparison of their computation speeds against other popular network analysis libraries:
The trick here was to scrape locally and save the networks to a csv, run graph-tool in a docker instance with access to the local network csvs, and save the results of the graph-tool testing locally for further python manipulation. Graph-tool is extremely efficient, but the set up can be non-trivial. Thus, treating it as a blackbox simply for calculating network statistics worked well. Once these values are calculated, the visualization can be done with more libraries in the local environment.
To address the original hypothesis: we have no real conclusion.
Though some of the reposting timelines have visible spikes, it’s hard to point to one user as the cause.
The same user is not consistently the cause of spikes over multiple songs by the same artist, so spikes in the reposting timeline could just as well have been caused by artist-driven actions like social media pushes, exposure at a concert, or other marketing tactics.
On top of that, even if there are Mavens detectable in this way, their effect could easily be “smoothed out” over time based on other variables such as timezone differences or varying user activity on the site. For example, even if a Maven from Tokyo religiously reposts artist A’s song, if most of their listeners are in New York, the timezone will affect when the effect of the Maven is picked up and which audience it is reaching, essentially minimizing the effect of the Maven.
That being said, there are still some notable findings from comparing the basic network characteristics across different artists.
Here, we see all three Indian-American DJs with some of the highest clustering coefficients. This reinforces the fact that the music they make is for fairly insular groups who are likely to follow each other, thus driving up the clustering coefficient.
Here, we compare the percent of followers that are also followed by the artist themselves. We see Kevin, one of the known Mavens, with amongst the highest edge reciprocity. We also see Laszewo, the most popular of the artists analyzed, with the lowest reciprocity. This lends itself to believe that artist accounts generally following fewer people while Mavens are strongly connected to their followers.
Perhaps one of the most interesting revelations answered a question completely unrelated to the initial one. When analyzing motifs (subgraphs with a specific edge topography) for every artist, I generated visualizations such as this one, also known as Motif Significance Profiles:
The x axis shows the “motif identifier” and the y axis shows the number of such motifs in the network. For directed subgraphs of 3 nodes, there are exactly 13 possible motifs to be constructed:
For Mawi, we see there are mostly motifs of types 6, 7, and 8. This is fairly common for younger, up and coming artists as they tend to follow many others in order to gain a following of their own.
Our known Maven’s motif significance profile looks significantly different. The most common motif is type 9. This checks out nicely with our previous insight for high edge reciprocity, as well as our intuition for what a Maven is. Since he serves as a sort of “bridge” between users who subscribe to his music taste, with motifs such as type 9, he is able to listen to songs posted by one follower and share it with other followers in that subgraph.
Intuitively, it makes sense for there to be only 13 motif types. If there are a fixed number of nodes in a graph, there is a maximum number of edges in the graph, and thus there will be a fixed number of arrangements for the edges. However, for two artists, the data revealed the following:
This shows over 100 distinct motifs of 3 nodes in the network! On further inspection, one of the motifs with a higher identifier number looks like the one on the right.
In the context of followers, the motif above shows “Node A follows Node B 5 times” which, obviously, makes little sense. As a single user, how can I follow an artist more than once? A dive into the edge data revealed the names of the accounts associated with these multiple follows, one of which is known as “switch-network”.
Accounts like these revolve around providing an artificial boost to the number of followers and reposts for a given artist. The perceived popularity of the song or artist is then amplified, which pushes them up out of the noise of SoundCloud posts to be recognized and selected for playlists, which is known to be the main algorithm used to distribute new music. Unfortunately, there are these “bots” that corrupt the built-in metrics for popularity, making the system itself inherently flawed. With these findings, the scraping engine built can easily be reconfigured as a bot detection tool to list all malicious or autonomous accounts. Along those lines, here are some of the follow up questions I thought of while working on this project:
- Knowing the malicious players, could you somehow normalize the network to show the true repost/follow/like count for tracks and artists? Essentially, could you clean the network?
- How do network characteristics such as betweenness and coefficient correlate with the time it takes for a song to reach 50, 100, 500 reposts?
- How does the genre of the song affect its spread through the network? Are listeners of certain genres more keen on reposting?
- How are brand new artists in a network treated? Do these networks follow similar trends as certain known generational models?
- What affect do certain events (removing/adding an artist, merging two genres, live performances, etc.) have on a network?
- Based on user activity, what is the best time to post (or listen) on SoundCloud?
- Is there a better, network-based way to fairly suggest and disseminate new music?
Over the course of a couple weeks, what started out as a research question about viral networks and spread ended with an algorithm to systematically pick out automatic reposters and followers on SoundCloud. The lesson here, if anything, is to dig into the data. You never know what insights you’ll end up with.