minstrel
Mar 5th 2007, 9:39 am
Alexa Toolbar and the Problem of Experiment Design (http://norvig.com/logs-alexa.html)
by Peter Norvig
Consider the problem of comparing traffic to internet sites. Most sites keep their traffic numbers secret, so you need to rely on third parties that monitor a sampling of traffic. One such third party is the alexa.com traffic rankings. If you download a toolbar from Alexa, your visits are tracked anonymously, and the aggregate statistics are available for all to see. As Alexa explains there are some biases inherent in this process: sites associated with Alexa, such as Amazon.com are overrepresented; sites that use https protocol are underrepresented, and so on. But one bias they don't really comment is the selection bias: the data would be good if it truly represented a random sample of internet users, but in fact it only represents those who have installed the Alexa toolbar, and that sample is not random. The samplees must be sophisticated enough to know how to install the toolbar, and they must have some reason to want it. It turns out that the toolbar tells you things about web sites, so it is useful to people in the SEO industry, so it overrperesents those people.
For example, let's look at the log stats for my site and for some of my friends who have recently published their stats for 2006. We list the actual number of visits and pageviews, and the Alexa numbers for reach and pageviews. The difference is quite profound. For example, I get about twice the pageviews of mattcutts.com, but his Alexa pageview ranking is about 25 times more than mine (I got this by looking at the 1 year, most highly smoothed graph, and then squinting to guess at the mean). What that means is that people with the Alexa toolbar installed are 25 times more likely to view a page on Matt's site versus mine, but overall, all users view twice as many pages on my site. That's a 50 to 1 difference introduced by the selection bias of Alexa. Presumably this is because Matt's site is really appealing to a core group of SEO enthusiasts, many of whom also like the Alexa toolbar.
Keep that in mind, next time you see a statistic on web usage (or any statistic): the results are only as good as the selection process that brings in the data.
...read more and view data (http://norvig.com/logs-alexa.html)
by Peter Norvig
Consider the problem of comparing traffic to internet sites. Most sites keep their traffic numbers secret, so you need to rely on third parties that monitor a sampling of traffic. One such third party is the alexa.com traffic rankings. If you download a toolbar from Alexa, your visits are tracked anonymously, and the aggregate statistics are available for all to see. As Alexa explains there are some biases inherent in this process: sites associated with Alexa, such as Amazon.com are overrepresented; sites that use https protocol are underrepresented, and so on. But one bias they don't really comment is the selection bias: the data would be good if it truly represented a random sample of internet users, but in fact it only represents those who have installed the Alexa toolbar, and that sample is not random. The samplees must be sophisticated enough to know how to install the toolbar, and they must have some reason to want it. It turns out that the toolbar tells you things about web sites, so it is useful to people in the SEO industry, so it overrperesents those people.
For example, let's look at the log stats for my site and for some of my friends who have recently published their stats for 2006. We list the actual number of visits and pageviews, and the Alexa numbers for reach and pageviews. The difference is quite profound. For example, I get about twice the pageviews of mattcutts.com, but his Alexa pageview ranking is about 25 times more than mine (I got this by looking at the 1 year, most highly smoothed graph, and then squinting to guess at the mean). What that means is that people with the Alexa toolbar installed are 25 times more likely to view a page on Matt's site versus mine, but overall, all users view twice as many pages on my site. That's a 50 to 1 difference introduced by the selection bias of Alexa. Presumably this is because Matt's site is really appealing to a core group of SEO enthusiasts, many of whom also like the Alexa toolbar.
Keep that in mind, next time you see a statistic on web usage (or any statistic): the results are only as good as the selection process that brings in the data.
...read more and view data (http://norvig.com/logs-alexa.html)