Quantization of Books 3: How Many Books Is That?

When I saw the data generated by the sales rank tracker Matthew Beckler was kind enough to put together, I joked that I hoped to someday need a logarithmic scale to display the sales rank history of How to Teach Physics to Your Dog. Thanks to links from Boing Boing, John Scalzi, and Kevin Drum, I got my wish:

For those not familiar with the concept, a log scale plots values on a scale that represents each order of magnitude as a fixed distance. So, the top horizontal line on that plot represents a sales rank of a million, the line below that a hundred thousand, the line below that ten thousand, and so on. This tends to blow up the detail at smaller ranges, allowing you to see more of the variation. On a linear scale, everything after the big downward spike at about 260 hours is just flat. Zooming in just a little, it still looks like this:

There’s still a good deal of variation in the flat bit of that graph, from a minimum value of 396 to a maximum of just over 2500 (as of 8pm Eastern Sunday night), but it’s hard to see just what’s going on without losing the higher points of the data.

This is all very nice, but of course, the whole point of having this data is to try to extract information that you wouldn’t be able to get otherwise. So, can we figure out from this plot how many books were sold in this interval?

If you recall my previous excursion into number-crunching of these data, you’ll remember that I made a plot of the (downward) change in sales rank as a function of the starting sales rank. This turned out to be remarkably linear, corresponding to a model where a single sale at a lower rank produces less of a change in that rank than a single sale at a higher rank. In other words, if you start at a ranking of 100,000, selling one book leaps you past a large number of other books, while if you start at a ranking of 1,000, a single sale doesn’t make as much difference.

Doing the same thing with the larger dataset yields the following plot (I’ve deleted a few oddball points where the rank changed by only a few places in the 70,000 range):

The blue points are data from before the publication and the big sales boost from Boing Boing/ Whatever, the red are points from after that. You can see that they clearly don’t all fall on the same line. The two solid lines represent straight lines fit to the two data sets, and you can easily figure out which equation goes with which. It should be noted that while on this scale, the red points sort of look like they fit a line, if you zoom in, they really don’t:

I suppose you could fit a line to that, if you were an economist or an astronomer, but I’m not going to waste anybody’s time with that.

so, using this linear model, what does the big downward spike correspond to? Well, using the fit parameters from the plot above would suggest that the large jump Tuesday morning was 5.4 times bigger than the model would predict, suggesting that it represents the sale of 5-6 books.

That’s nice, and all, but the problem is that the next spike down, according to the model, represents the sale of -1.4 books. That’s because the fit above has a non-zero intercept, meaning that it predicts a ranking change of zero for a single sale at a rank of around 14,000, and below that level, the ranking change in negative. That’s clearly wrong– if 1.4 people returned their copies, my sales ranking would not get better.

So, how could we improve this? Well, logic dictates that a sales rank of 1 can’t get any higher, so we could impose a model where the ranking change is 0 for a sales rank of 1. If we do that, the pre-publication data look like this:

I’ve done two different fits to this, one a linear fit constrained to go through the origin, the other a power law fit, just to have something with a bit of upward curve to it. Using the simple linear model, the big downward jump corresponds to about 2 books, and using the power-law fit, it’s 4 books. Interestingly, the power-law fit gives higher values for some of the later downward jumps, with a peak of 13 for the jump from 1106 to 683 a few hours after the initial spike.

So, how many books does all this represent? Well, summing up all the changes from the power-law model gives 154 books. The same summing for the simple linear fit with zero intercept gives just 11– the vast majority of the points after the spike correspond to less than one book’s worth of that model’s prediction. A third fit, using a second-order polynomial (which had a slightly better R² than either of the others) predicts around 30 books.

None of these models are particular good, though– for one thing, the fits aren’t great. And there’s no particular justification for the use of a power law or a parabola– they’re just easy functions to work with, mathematically.

In the end, the best I can say is that, over the whole data period, there are just about 100 points where the sales rank improved from one hour to the next. If you take the incredibly naive picture that each of those improvements represents at least one sale, that gives a lower bound of about 100 books sold. That’s more or less consistent with other peoples’ analyses of what sales rank means in terms of sales.

Which of these figures is right? I have no idea, and no way to determine the answer. I won’t get any kind of real sales numbers for at least six months, maybe a year (unless somebody at Scribner is feeling generous, and wants to send me numbers). What I eventually get won’t be nearly fine-grained enough to determine the number of sales via Amazon in the first week after publication, either.

But, hey, playing with numbers is fun…

One thought on “Quantization of Books 3: How Many Books Is That?”

Moopheus says:

December 28, 2009 at 2:12 pm

Your guess is probably a little low–I’d be willing to bet more like 2-400. But my personal experience with Amazon sales ranks is a little stale–several years old. But when I worked “in-house” as it were for New York publishers, I could actually see how much Amazon was ordering for any given title (and this was before Bookscan, so it wasn’t always clear what actual point-of-sales were). But since the number represents a relative rate of sales, it’s difficult to translate it to absolute numbers.

Comments are closed.