New SPAM sites...billions of results!!!!

lorien1973 Notable Member

Messages:: 12,206

Likes Received:: 601

Best Answers:: 0

Trophy Points:: 260

#721

In all serious, here is a good lesson from this.

There is no stay required in the sanbox, it happens because of a faulty (or perhaps reasonable) link strategy. If a site gets a ton of inbound links from different (sub)domains at once, it can bypass the sanbox as it is considered to be different (a new fad or whatever) - an other example is the "numa numa" video. It hit google immediately, because it got a ton of inbounds all at once. As did these sites. Tons of inbounds (legitimate or not), instant ranking on google. Didn't see this dude wondering about when the sandbox would release him.

So, IMHO, when starting a new site, it makes good sense (maybe not long term sense, but who knows) to get as many inbound links as possible to avoid any "sandboxing"

lorien1973, Jun 24, 2006 IP

latehorn likes this.

pjbrunet Peon

Messages:: 16

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#722

Obelia said: ↑

I don't think it's as simple as that, a lot of spammers
1. An advanced dupe-content detector that parses out the most frequent keywords and examines what's left for similarity to other pages on the rest of the site that have also been stripped of their most frequent keywords.

2. A gobbledegook detector to search for poor grammar, and flag up pages with a high percentage of ungrammatical sentences.

3. Most sentences include at least some stop-words. An unusual lack of these suggests spam.

4. Something to look for misspellings as a percentage of text. A lot of spammy sites have long lists of these, so anything over 20% should trigger a filter.
Click to expand...

There's a website called "turnitin.com" or something like that, so teachers can catch cut-and-paste plagiarism. Google would need to do that on a much larger scale, and faster. Looking for dupes, more credit should be given to the first occurrence of a phrase. This would encourage better writing and discourage scraping. I'm sure I'll see eventually, at least in my lifetime

I have also seen algorithms that will read some text and estimate the level of education of the writer. I wouldn't be surprised if Google has something like this already, at least on a rudimentary level.

Google already checks for profanity, so searching for select grammar mistakes and blatant mispellings would seem feasible too. Google isn't just in the search biz--their biz is promoting the Internet too. That's why they want everyone to have Internet access: more people online = more money. The Internet would be more attractive if, on the whole, people were motivated to write well. Hell, I'll code it myself if I have to. A script to rank the blogosphere by level of education would be disruptive for sure. If you steal my idea, I want credit

pjbrunet, Jun 24, 2006 IP

Obelia Notable Member

Messages:: 2,083

Likes Received:: 171

Best Answers:: 0

Trophy Points:: 210

#723

pjbrunet said: ↑

I have also seen algorithms that will read some text and estimate the level of education of the writer. I wouldn't be surprised if Google has something like this already, at least on a rudimentary level.

Google already checks for profanity, so searching for select grammar mistakes and blatant mispellings would seem feasible too. Google isn't just in the search biz--their biz is promoting the Internet too. That's why they want everyone to have Internet access: more people online = more money. The Internet would be more attractive if, on the whole, people were motivated to write well. Hell, I'll code it myself if I have to. A script to rank the blogosphere by level of education would be disruptive for sure.
Click to expand...

There are a couple of problems with this, but neither are insurmountable: people who write on really technical or niche topics are likely to include a lot of jargon, which a spellchecker would not recognise. So the threshold for misspellings needs to be fairly high. The other problem might be people who record no language in their html, or mix several languages in one document, or who accidentally put in the wrong language encoding.

Obelia, Jun 24, 2006 IP

pjbrunet Peon

Messages:: 16

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#724

This is what I was thinking of--something like a "Readability Test"

http://juicystudio.com/services/readability.php?url=

pjbrunet, Jun 24, 2006 IP

Manish Pandey Well-Known Member

Messages:: 656

Likes Received:: 43

Best Answers:: 0

Trophy Points:: 133

#725

Well certainly this has to happen... and i guess the google engineers now have to reconsider there algo and tweak them a little so that we dont have to fear of certain spam any more...

Manish Pandey, Jun 24, 2006 IP

lorien1973 Notable Member

Messages:: 12,206

Likes Received:: 601

Best Answers:: 0

Trophy Points:: 260

#726

Then we can fear a whole new type of spam!

To quote an apropo Star Wars line:

Google: You don't know how hard I found it, signing the order to terminate your life, spam sites.
Spam sites: I'm surprised that you had the courage to take the responsibility yourself.
Google: t1ps2see, before your execution, you will join me at a ceremony that will make this battle station operational. No star system will dare oppose the Google now.
Spam sites: The more you tighten your grip, Google, the more spam will slip through your fingers.
Click to expand...

lorien1973, Jun 24, 2006 IP

Obelia likes this.

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#727

Another one, 163,000 sub-domains.
http://www.google.com/search?q=site:v1qn.net

er, didn't Google say they were fixing this!!!! For those of you in Rio Linda, that just means banning them as they find them.

Nintendo, Jun 24, 2006 IP

anthonycea Banned

Messages:: 13,378

Likes Received:: 342

Best Answers:: 0

Trophy Points:: 0

#728

Been going on since 2003 man!

You guys did not discover this first!

anthonycea, Jun 24, 2006 IP

lorien1973 Notable Member

Messages:: 12,206

Likes Received:: 601

Best Answers:: 0

Trophy Points:: 260

#729

zeezo at least has some content; I think its a different animal. About makes subdomains too, but it has content as well.

lorien1973, Jun 24, 2006 IP

jimboot Active Member

Messages:: 146

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 58

#730

CrankyDave said: ↑

This in part depends upon the DC. The one I checked just showed 21.38B

Dave
Click to expand...

Yep and mine shows
Results 1 - 10 of about 23,920,000,000 for a. (0.20 seconds)
Which is similar to the before the site was removed.

jimboot, Jun 24, 2006 IP

anthonycea Banned

Messages:: 13,378

Likes Received:: 342

Best Answers:: 0

Trophy Points:: 0

#731

lorien1973 said: ↑

zeezo at least has some content; I think its a different animal. About makes subdomains too, but it has content as well.
Click to expand...

Right, that was the first one ever discovered that was a big time scam, Google took many of the back to back results out of the SERP's but left the sites indexed.

I really don't need to get into it because everyone that was here then remembers the story!

anthonycea, Jun 24, 2006 IP

markhutch Peon

Messages:: 357

Likes Received:: 22

Best Answers:: 0

Trophy Points:: 0

#732

Back to the "Bad Data Push" statement last weekend. I have noticed a decrease in the number of pages from my web sites that were corrupted up until that "push". They are now coming out of supplemental status as they are recrawled by Googlebot. I seemed to have had a ton of pages with errors in the cache version of the page title. Sorry if this has already been covered in this thread. Iâ€™ve been working on other projects all week and donâ€™t have time to read this entire HUGE thread tonight.

markhutch, Jun 24, 2006 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#733

I am slowly - VERY slowly - seeing some recovery too, Mark. Over the past few days, some of those non-existent pages are dropping out and real currently existing pages are being added.

I don't know what they did but somebody is NOT going to get his Christmas bonus at the Googleplex this year.

minstrel, Jun 24, 2006 IP

old_expat Peon

Messages:: 188

Likes Received:: 9

Best Answers:: 0

Trophy Points:: 0

#734

webviz said: ↑

Ohh, prehaps because everything is IP based. Everything in their website is included through IP no domain.
Click to expand...

How can this be so when there are so many domains/sites on shared IPs?

old_expat, Jun 24, 2006 IP

markhutch Peon

Messages:: 357

Likes Received:: 22

Best Answers:: 0

Trophy Points:: 0

#735

Iâ€™m glad to hear that Iâ€™m not the only one seeing improvement. When I say improvement Iâ€™m talking about an additional 50 pages or so out of 1000, but at least it is progress. I think Google had to push that bad data back into the index before Googlebot could recrawl it. More than likely when their software detects errors it removes those pages from the index. Iâ€™ll bet they had millions of pages with similar errors on them. These errors were not the fault of webmasters, but how Google fetched those pages and stored them to begin with. I guess the good news in that Googlebot is extremely fast at crawling and I hope that things will get back to normal before too much longer.

markhutch, Jun 24, 2006 IP

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#736

Some one else easily found five million SPAM pages!!
http://www.digg.com/technology/5_Mi..._a_Couple_Hours_That_Google_Missed_All_Week_2

Nintendo, Jun 24, 2006 IP

MikeSwede Peon

Messages:: 601

Likes Received:: 16

Best Answers:: 0

Trophy Points:: 0

#737

is actually that they had so MANY pages indexed.
There are thousands of sites like that that is sneaking under the radar with 50-75,000 pages indexed in one domain but they have maybe hundreds of them in the index. I have reported a lot of them because they have scraped content from my sites but Google doesn't give a s**t about it. All they care about is that they are the leading SE with most number of pages so it looks good and that they can show their ads on spam sites and make some $$$
I don't understand WTF they are thinking when original sites with good content are deleted but sites with scraped content from my site are still there?? How weird is that???

MikeSwede, Jun 24, 2006 IP

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#738

MikeSwede said: ↑

Google doesn't give a s**t about it.
Click to expand...

Oh yes they do...if it makes the index page of Digg!!! Then Spamoogle quickly banns them!!

Nintendo, Jun 24, 2006 IP

MikeSwede Peon

Messages:: 601

Likes Received:: 16

Best Answers:: 0

Trophy Points:: 0

#739

Ahhh.... but then there has to be 1 billion pages or so, right

MikeSwede, Jun 24, 2006 IP

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#740

It looks like it. It looks like five million SPAM pages isn't enough!!!

Nintendo, Jun 24, 2006 IP

Log in or Sign up

New SPAM sites...billions of results!!!!

lorien1973 Notable Member

pjbrunet Peon

Obelia Notable Member

pjbrunet Peon

Manish Pandey Well-Known Member

lorien1973 Notable Member

Nintendo ♬ King of da Wackos ♬

anthonycea Banned

lorien1973 Notable Member

jimboot Active Member

anthonycea Banned

markhutch Peon

minstrel Illustrious Member

old_expat Peon

markhutch Peon

Nintendo ♬ King of da Wackos ♬

MikeSwede Peon

Nintendo ♬ King of da Wackos ♬

MikeSwede Peon

Nintendo ♬ King of da Wackos ♬

Useful Searches