This is following on from the thread about billions of spam pages in Google's index: http://forums.digitalpoint.com/showthread.php?t=97090 I think it might be helpful to split this topic off rather than hijacking the discussion on an already-massive thread. Just to recap some good ideas that people had: Stringerbell suggested Google do a manual check on the top 1000 Adsense publishers, rated according to their "Google juice" (ie, the overall best placed), and the top 1000 sites by indexed pages. SVZ suggested a manual check on sites that reach over 1 million pages indexed. I suggested an algorithm that kicks in for sites with lots of pages, to look for misspellings, poor grammar, and sentences without stop words by a percentage of overall text; and also an advanced dupe content filter that takes out the most frequent keywords. pjbrunet mentioned dupe-content checkers like turnitin, and the readability tests you can do to figure out the grade-level of text. Here's an extract of some spam from one of the sites mentioned in that thread. This is to give you some idea of what Google is up against. This random text is sliced oddly, as though it's come from someone's porn ebook with every fourth word missing, or some such. I'm not sure that any of the methods so far mentioned would be able to catch it with an algorithm, except perhaps a grammar checker.
As long as googlebot remains an automated crawler theres nothing they can do about the spam. Blackhat seo is a cat and mouse game and google is always trying to catch up. As for checking the top 1000 adsense publishers that won't really do much since non spam sites don't need to be in the top 1000 to make serious money or even use adsense. Same goes for the ammount of pages indexed, most people who do bh seo would setup 1000 domains with 10,000 pages each instead of 1 domain with 1 million pages.
The irony of your keyword choice in this post floored me I am currently optimizing for Hilton Head Rental/Hilton Head Timeshare Edited to add: I'm doing it the right way though...
I got it from one of the spam sites mentioned in the "billion pages" thread, it was one of the first pages to come up on a site: query. Actually, I thought that was a semi-nonsense set of keywords, and not something you would actually optimise for. Shows what I know. True. But they have to at least try, or else they might as well shut up shop and start building a directory. All it would do is remove the most prominent spam sites, and ever so slightly increase costs for blackhats. So, more of a public relations exercise than anything else. I'm beginning to think that the survival of search engines will hinge on just one question: is there any way for a machine to distinguish grammatically correct but random nonsense from real information?
And: you should just check some major travel sites and you'll find them scoring in the top 10 in Google, while they have long lists of spammy links... keyword-repeating links and lists. But Google kicked lots of ethical sites... It seems like especially the small businesses, small sites, personal sites get hit and the big ones aren't that affected. Now I wonder if your post here could be considered as "spam" by Google? Is Google capable of telling that we're just talking about spam here and that you're not actually spamming? I guess not... I guess to fight spam better they have to look ad the visitor metrics. Signs up spam in Analytics might be: -very high bounce rate -very low avg. page views -very few returning visitors
Until this year, that would probably have been about correct. It's no longer true though. Their systems have now gotten close to the level of spam detection a human visitor could employ, and that's enough to weed out 99% of the crap. The other 1% will get spotted by being reported, or suspicious adsense activity, or unnatural link profiling. Blackhat... R.I.P.
Google is certainly getting smarter about that. Previously spam survived for quite a long time unnoticed but today its detected in quite a short period of time. Manual reviews are also playing an important role in this regard apart from the technological advancement and use of Artificial Intelligence.
Spammers are getting more and more sophisticated. I'm seeing more and more farmed hyper-spam content that's a more elevated version of spam. More words, better-built-up context, but still spam. I think Google is having a hard time fighting it.
What makes you say that? Do you have a keyword search that results in spammy SERPs? (A keyword search that could be used to make money, obviously).