In all serious, here is a good lesson from this. There is no stay required in the sanbox, it happens because of a faulty (or perhaps reasonable) link strategy. If a site gets a ton of inbound links from different (sub)domains at once, it can bypass the sanbox as it is considered to be different (a new fad or whatever) - an other example is the "numa numa" video. It hit google immediately, because it got a ton of inbounds all at once. As did these sites. Tons of inbounds (legitimate or not), instant ranking on google. Didn't see this dude wondering about when the sandbox would release him. So, IMHO, when starting a new site, it makes good sense (maybe not long term sense, but who knows) to get as many inbound links as possible to avoid any "sandboxing"
There's a website called "turnitin.com" or something like that, so teachers can catch cut-and-paste plagiarism. Google would need to do that on a much larger scale, and faster. Looking for dupes, more credit should be given to the first occurrence of a phrase. This would encourage better writing and discourage scraping. I'm sure I'll see eventually, at least in my lifetime I have also seen algorithms that will read some text and estimate the level of education of the writer. I wouldn't be surprised if Google has something like this already, at least on a rudimentary level. Google already checks for profanity, so searching for select grammar mistakes and blatant mispellings would seem feasible too. Google isn't just in the search biz--their biz is promoting the Internet too. That's why they want everyone to have Internet access: more people online = more money. The Internet would be more attractive if, on the whole, people were motivated to write well. Hell, I'll code it myself if I have to. A script to rank the blogosphere by level of education would be disruptive for sure. If you steal my idea, I want credit
There are a couple of problems with this, but neither are insurmountable: people who write on really technical or niche topics are likely to include a lot of jargon, which a spellchecker would not recognise. So the threshold for misspellings needs to be fairly high. The other problem might be people who record no language in their html, or mix several languages in one document, or who accidentally put in the wrong language encoding.
This is what I was thinking of--something like a "Readability Test" http://juicystudio.com/services/readability.php?url=
Well certainly this has to happen... and i guess the google engineers now have to reconsider there algo and tweak them a little so that we dont have to fear of certain spam any more...
Another one, 163,000 sub-domains. http://www.google.com/search?q=site:v1qn.net er, didn't Google say they were fixing this!!!! For those of you in Rio Linda, that just means banning them as they find them.
zeezo at least has some content; I think its a different animal. About makes subdomains too, but it has content as well.
Yep and mine shows Results 1 - 10 of about 23,920,000,000 for a. (0.20 seconds) Which is similar to the before the site was removed.
Right, that was the first one ever discovered that was a big time scam, Google took many of the back to back results out of the SERP's but left the sites indexed. I really don't need to get into it because everyone that was here then remembers the story!
Back to the "Bad Data Push" statement last weekend. I have noticed a decrease in the number of pages from my web sites that were corrupted up until that "push". They are now coming out of supplemental status as they are recrawled by Googlebot. I seemed to have had a ton of pages with errors in the cache version of the page title. Sorry if this has already been covered in this thread. I’ve been working on other projects all week and don’t have time to read this entire HUGE thread tonight.
I am slowly - VERY slowly - seeing some recovery too, Mark. Over the past few days, some of those non-existent pages are dropping out and real currently existing pages are being added. I don't know what they did but somebody is NOT going to get his Christmas bonus at the Googleplex this year.
I’m glad to hear that I’m not the only one seeing improvement. When I say improvement I’m talking about an additional 50 pages or so out of 1000, but at least it is progress. I think Google had to push that bad data back into the index before Googlebot could recrawl it. More than likely when their software detects errors it removes those pages from the index. I’ll bet they had millions of pages with similar errors on them. These errors were not the fault of webmasters, but how Google fetched those pages and stored them to begin with. I guess the good news in that Googlebot is extremely fast at crawling and I hope that things will get back to normal before too much longer.
Some one else easily found five million SPAM pages!! http://www.digg.com/technology/5_Mi..._a_Couple_Hours_That_Google_Missed_All_Week_2
is actually that they had so MANY pages indexed. There are thousands of sites like that that is sneaking under the radar with 50-75,000 pages indexed in one domain but they have maybe hundreds of them in the index. I have reported a lot of them because they have scraped content from my sites but Google doesn't give a s**t about it. All they care about is that they are the leading SE with most number of pages so it looks good and that they can show their ads on spam sites and make some $$$ I don't understand WTF they are thinking when original sites with good content are deleted but sites with scraped content from my site are still there?? How weird is that???