Zipf it

July 05, 2011

At some point I was concerned about whether my iTunes listening habits observe Zipf’s law. Let’s find out.

My library has 1625 songs, of which I’ve played 1186 (72%) at least once. On average, each song has been played 4.890 times, but that’s a crude measure because I play some songs more frequently than others. Let’s take a look at the full distribution:

This is a histogram. It shows the number of songs in my library that have been played once, twice, three times, etc. As you can see, the distribution has a long tail: most songs are listened to only a few times (e.g., 1299 of your songs have at most 5 plays), but some are listened to again and again (e.g., my most-listened-to song has 128 plays, way out there on the right-hand side of the plot). My top 10 songs together make up 63% of my total listening time.

There’s a better way to visualize data like this: a log-log plot of frequency versus rank.

Here’s how to read the graph. The horizontal axis gives the rank, so my most-listened-to song (rank 1) is the red point at the far left. The vertical axis gives the frequency, expressed as a percentange of total listening time.

Q: How do the two relate?

Zipf’s law proposes an answer to this question. It holds that the frequency of an item (e.g., a word in a body of naturally-occuring text) is inversely proportional to its rank. Visually, Zipf’s law predicts that the above data would be well fit by a straight line. Let’s find out by superimposing over the data the best-fit linear function, drawn in grey:

So what do you think? Does Zipf’s law seem to hold for my iTunes listening? Try to find parts of the curve that fit the data particularly well, or not so well. For many datasets, the slope of the above line is approximately -1; for my data, it’s -0.660.

For further reading:

Seeing around corners by Jonathan Rauch in The Atlantic.
Benford’s law, Zipf’s law, and the Pareto distribution by Terence Tao on his blog.
The wikipedia article on Zipf’s law.