I think the duplicate content penalty of Google is over-rated. Re the example with EB Games and Gamespot I have seen this type of thing a lot, and it's almost NEVER penalised. Reason - they are established, reputable, large sites. Where it DOES occur is on blogs, articles etc which have the same articles posted. Generally it's not so much Google penalizing a site specifically for having this type of information, rather it sees the two lots of content as being the same and then just picks one to index. The ones NOT indexed will usually see it as a penalty, more it's just Google recognizing the duplicate content and picking one lot only to index. Matt
I'am running a related experiment. Seem like websites with an amount of incoming pagerank has the abount of maximum indexed pages - page rank experiment
That's a nice experiment. Don't forget to let us know if the pages get de-indexed (or lose rank) in a couple weeks.
I wasn't talking about the date on the site. I AM talking about the date that Google first indexed the content on each of the sites. They are not going to use index date either, because like I said, different sites have different crawl frequencies. Some only get crawled once per month or so, others get crawled once every two weeks or so, some get crawled once a week, others get crawled constantly - 24x7. So it would be VERY unfair to those sites that only get crawled once per month or once every week or two to use the index date/time to distinquish duplicates. They would never get credit for their content if other sites which get crawled more frequently were copying and republishing it on their sites.j Here's a video of Cutts talking about it at SMX a couple years ago. He specifically mentions how they have to take crawl rate into account when coming up with a solution for figuring out the original from duplicates to prevent smart blackhats from claiming content from sites that are crawled infrequently. He mentions how dups aren't so much a problem w/ blogs because of pings... but they do w/ traditional web sites. He also mentions adding a link in RSS feeds back to the original version on your site to help them distinguish original from dups.
Sorry I forgot the link. At 2 min in he specifically mentions how they have to consider crawl rates... at about 2:45ish he mentions putting links back to your copy of the content in RSS feeds to help them know who the originator is. Of course, it's 2 years old. Lot's can change between then and now. However, it shows that they don't like to implement "checks" for things that can affect rankings (like duplicate content) unless it can be done in a way that is fair and not spammable. And since using the first copy crawled as the originator would be totally unfair to sites that get crawled infrequently, it's very unlikely that is the determining factor.