I just had a thought while I found one of many wikipedia clones on the web. Isn't Google supposed to penalize for duplicate content? If so, why are there so many websites like: www.informationblast.com www.answers.com www.biocrawler.com All from less than a minute of searching. Is it because wikipedia's content changes quickly, and they have archives of wikipedia's content? But still, wouldn't large sections appear to be plagurised?
yeah - when I was researching for a uni project answers came up a lot. It was interesting to see that it was duplicate content from wikipedia but the answers website was rating higher in the results!
Doesnt wikipedia have index restrictions? I remember Google was pissin their pants to index all of wikipedia but were blocked only recently. Those folks at answers.com most have had a difficult time copying & pasting all those articles
Crawl-delay: 1 for wikipedia. Google is not blocked. http://download.wikimedia.org/ Wikipedia's database is dumped here. (Oh noes! I've just probabaly fostered another creation of a wikipedia clone!)
yes, but even G's processing power is finite. think of how many pages there are on the web. now think of the processing power it would take to compare one page to the rest of the web. now multiply that processing time by the amount of pages in the web and you can see how much it would take to do so. dup content is primarily penalized when it is within the same site. it is rarely checked from site to site, and even then it is only done for a handful of sites.
Sorry I actually jumped the gun here without proper knowledge. Would I be right to say that Google offered to host Wikipedia on their own servers? I was also wrong to say about answers.com copy & pasting http://meta.wikimedia.org/wiki/Wikimedia_partners_and_hosts