Google's war on proxy sites. (why it exists, and why it's important)

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#1

I intend this guide to help people who've had their site hijacked from the Google Serps. (Many people just assume that their site has a penalty - and in vain they cannot figure out why)

A quick primer on the different types of proxies out there:
-Good type: robot.txt disallows search engines from crawling any proxied content
-Semi-Bad type: No robots.txt, but doesn't strip "no robots" meta tags or cache content
-Bad type: No robots.txt, strips "no robots" meta tags and caches content.

The bad and semi-bad proxies are harmful to your google rankings if they succeed in tricking google into believing that they really "own" your content. Yes, it can and does happen - so beware!

Why do people do this? Most proxies are setup for the money that comes from displaying ads in the proxied pages. It seems that the bad proxies which happen to steal your rankings might list for some long tail keywords. It is likely that the traffic levels are pretty low since most of the pages have a lower page rank (or no pr) than the original page. The result is making a mess of search engine results.

Srsly! I can understand why I see so many people selling their black-hat proxy sites. (probably got banned in google after getting caught - maybe even getting a notice of shutdown from their ISP?) BH Proxy owners, be sure to ask yourself this: "Would I be annoyed if my site was jacked from the Serps?" The loss of rankings can be quite rough for the legitimate webmaster.

I've compiled a list of some things that can be done to combat and prevent these proxies from stealing and jacking your content.

1). Search for copies of text from your site. Use full quotes to get listings of sites which have copied your content
2). Check each site to see if it is really copying all of your content or just a snippet.
3). If the site is a proxy, determine the IP address. (you can do this by pinging the hostname, or even type www.whatismyip.com into the proxy to get a real IP behind the scraper)
4). Compile a list of the proxies and their IP addresses. You will need to use this list to defend your site.
5). You will need to have some form of includes in your site enabled: Making a script that will compare the requesting IP address to your list of IP addresses and return the following:
<meta name="robots" content="noindex,nofollow" />
Code (markup):
6). View the copy of your site from each proxy site you've identified and see if it has a copy of your site with the new noindex code.
7). Some proxies will play fair and you will then be able to delist your site by heading over the Google web controller https://www.google.com/webmasters/tools/removals Be sure to select "Information or image that appears in the Google search results.", then "The site owner has removed this page/image from the web or blocked it from being indexed", then enter your url on the last page and hit submit. Steps 5 to 7 ensure that your content will be removed fast from Google.
8). For any proxies that strip your "noindex" meta tag, you will need to add them to the 403 list.
9). When you setup your 403 list, import those urls into your .htaccess file to block the content outright.
10). You may notice that some proxies revert to a cached version of your page! In this case take down the IP address of these proxies so that you can follow up.
11). When it comes to dealing with the proxies you can try to email them directly, but getting a response from them is not common. It doesn't hurt to try! If all else fails, you may want to Politely inform their upstream provider. (do an http://arin.net whois search on their IP address - to see who manages their netblock) Usually seek the abuse department for optimum results.
12). Ensure that you are clear with your requests to any upstream provider. They are usually busy heavily worked people who get too many complaints from total noobs complaining about things that are really non-issues. Use as few words as possible and ensure that you mention that the proxy site in question is not honoring your robot tags, nor your 403 forbidden codes. Be sure to also mention the difference between this proxy and a normal proxy. Remember always be polite... I can't stress this enough... I used to work at an ISP and it can really get annoying when dealing with rude and belligerent requests. Remember folks, NOC personnel are people too. (NOC = Network Operation Center)
13). Follow up requests for further information or even initial denials with a polite summary and emphasis of your problem. Some NOCs are too busy to fully read your original email - so they may overlook some of your important details buried within the mess.

So that's a primer on some first steps for anyone who is experiencing this problem. I'll try to add some more details and answer any questions people may have. I have also compiled a list of some IPs to "noindex" and "403" which I will post later too. I also welcome any feedback.

gostats, Oct 2, 2007 IP

Game Producer, wdwp, Obelia and 3 others like this.

jinnnguyen Peon

Messages:: 117

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#2

gostats said: ↑

I have also compiled a list of some IPs to "noindex" and "403" which I will post later too
Click to expand...

Yes please, i'm the first in line waiting for that compiled list. TIA

jinnnguyen, Oct 2, 2007 IP

conradbarrett Active Member

Messages:: 369

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 55

#3

You have msn or aim, would like to talk to you.

Thanks,
Conrad

conradbarrett, Oct 2, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#4

Here's the list so far.
*I have manually processed these by hand.
*two or three of these have already been suspended by their ISP.
*IP address delegations can change. Keep a date associated with each your IP address list and review them occassionally for freshness.
*do an arin/ripe/apnic lookup on each url to ensure that you are not blocking a yahoo or google etc... (but unlikely since I looked these up)
*Sorry, I haven't been storing the proxy/scraper site with the IPs. Try doing a reverse lookup or just surf to the IP.
*Feel free to post your own here, but keep in mind: "a blacklist is only good as long as it is fresh and the false positives are kept to a minimum."

(sorry for the bad formatting, I'm just copying this output from my server.)

noindex: generally the scraper responds to insertion of the meta-noindex tag. However, there are a few "noindex" records that I have missed putting a deny tag on.
deny: just 403 these proxies, they are evil and ignore all tags/no cache/etc...
-tip: you may want to send garbage bits or garbage text to the deny records instead of a 403. (Since if they get a 403 they will just serve a cache copy of your page. By sending garbage, they will index and proxy garbage to the search engines. *do this if the site has a cache of your site already
noindex 	72.232.177.43
noindex 	65.98.59.210
noindex 	70.84.106.153
noindex 	64.209.134.9
noindex 	77.232.66.139
deny 	77.232.66.109
noindex 	66.246.246.50
deny 	66.90.104.77
noindex 	85.12.25.70
deny 	65.98.59.218
noindex 	64.85.160.59
noindex 	66.90.77.2
noindex 	74.86.43.151
deny 	208.100.20.148
noindex 	66.96.95.200
deny 	208.109.78.122
deny	72.249.119.71
deny	202.131.89.100
deny 	217.16.16.204
deny 	217.16.16.215
deny 	83.222.23.243
deny 	76.163.209.36
deny 	217.16.16.218
deny 	83.222.23.200
deny 	83.222.23.214
deny 	83.222.23.202
deny 	66.232.126.136
deny 	82.146.62.107
deny 	66.232.126.134
noindex 	70.84.106.146
deny 	78.129.131.123
deny 	70.87.207.234
deny 	76.162.253.46
deny 	64.72.126.8
noindex 	216.86.152.216
deny 	85.17.58.183
deny 	216.129.107.106
deny 	72.36.145.138
deny 	64.34.166.94
noindex 	70.183.59.7
noindex 	66.90.73.236
noindex 	69.59.22.27
deny 	74.86.61.234
deny 	69.59.28.145
noindex 	193.138.206.207
noindex 	205.209.146.144
noindex	203.146.251.107
noindex	67.159.44.136
noindex	85.92.130.117
Code (markup):

gostats, Oct 2, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#5

conradbarrett, I've sent a pm to you with my msn... However, just a note: I'm rarely on msn. Please just email or pm me.

gostats, Oct 2, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#6

btw: conradbarrett, I've noticed that the site in your signature has some content that seems to be the same as the description of the heros show in wikipedia. It might be best if you create a unique description in your own words. (Basically copying content snippets is also a way to de-vaule your site) As a rule of thumb never copy content unless you are given expressed permission to do so. - and then follow the terms for it.

Just a remnder: If you ever feel that you have a duplicate content penalty, check first that you have not copied any snippets from around the web. If another site says it better than you can, just link to them.

gostats, Oct 2, 2007 IP

speedster Peon

Messages:: 438

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#7

good work, thanks for the list

speedster, Oct 2, 2007 IP

sk33lz Peon

Messages:: 37

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 0

#8

Thanks for the info! I have not had any issues with this yet, but I am sure this list will help keep most of those problems away. Thanks

sk33lz, Oct 2, 2007 IP

ForgottenCreature Notable Member

Messages:: 7,473

Likes Received:: 173

Best Answers:: 0

Trophy Points:: 260

#9

Couldn't a script just been made to block all proxy sites from accessing your site?

ForgottenCreature, Oct 2, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#10

ForgottenCreature said: ↑

Couldn't a script just been made to block all proxy sites from accessing your site?
Click to expand...

Seems pretty easy doesn't it?
...Well only if the proxies play by the rules (which they don't).

For example, if they pass on a client's user agent (like IE or FireFox), then it is difficult to determine that the resource is a proxy. (very process intensive if your site is under any load.)

From a programmer standpoint, a good start is a well updated list of bad behaving proxies. Skim the SE listings daily or put a google alert out for copies of your content to find copies of your site.

If anyone has any suggestions on automating this bad-proxy-discovery process, I'm all ears!

gostats, Oct 2, 2007 IP

jinnnguyen Peon

Messages:: 117

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#11

Gostats, much appreciated for the list

Regards,

jinnnguyen, Oct 3, 2007 IP

Bothel_ Guest

Messages:: 36

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#12

I'm running 2 proxies atm. Thanks for the help!

Bothel_, Oct 3, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#13

Bothel_ said: ↑

I'm running 2 proxies atm. Thanks for the help!
Click to expand...

Remeber: Play nicely or someone may send a notice to your provider.
srsly though, if you want your proxies to stay legit and functioning you should ensure that you honor the no-robots meta tags. Even going as far as robot-blocking the proxied content will be both noble and really solidify your proxy. ...grumpy webmasters with hijacked content could even cause more troubles than just a banned proxy. (not all webmasters will have the resources to meta-block the proxied content - but still would not want to be scraped by proxies.)

gostats, Oct 3, 2007 IP

trichnosis Prominent Member

Messages:: 13,785

Likes Received:: 333

Best Answers:: 0

Trophy Points:: 300

#14

<meta name="robots" content="noindex,nofollow" />
Click to expand...

most of the proxy sites are removing it.

people must allways check their web sites against the proxy cheaters

trichnosis, Oct 3, 2007 IP

ThreeGuineaWatch Well-Known Member

Messages:: 1,489

Likes Received:: 69

Best Answers:: 0

Trophy Points:: 140

#15

One thing that might help (that doesn't require using that much noddle) is to ask the authors of the popular proxy scripts to include a barebones robots.txt with the script. I assume the majority of proxy owners are unknowingly aiding those with malicious intent simply by not being aware.

ThreeGuineaWatch, Oct 3, 2007 IP

trichnosis Prominent Member

Messages:: 13,785

Likes Received:: 333

Best Answers:: 0

Trophy Points:: 300

#16

i totally agree with you.

http://forums.digitalpoint.com/showthread.php?t=500687 contains my solution for the proxy cheater.

trichnosis, Oct 3, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#17

trichnosis said: ↑

i totally agree with you.

http://forums.digitalpoint.com/showthread.php?t=500687 contains my solution for the proxy cheater.
Click to expand...

Interesting solution. It would be good to run this only once per IP address - since multiple tests on the same IP address would waste time.
-I'm not sure if checking the IP address for a responsive port 80 is always the best way: because, what if the person has a personal webserver that is not a proxy. Or what if the person is working from a work place where their company website is responding to their IP address? (rare, but it's a case to consider)

update: There is a case where a proxy operator can spoof the no robots meta tag to normal clients, but strip it when Google spiders the page. You may even discover using the Google web controller: https://www.google.com/webmasters/tools/removals that their site does not get removed because they stripped the noindex tag. Or if you view the site with google translator (do a spanish to english) In this case, be sure to set the offender's IP to 403 forbidden and if needed, contact the IP netblock owner. Bascially if the proxy owner is going out of his/her way to scrape and rank with your content, it's best to just get the proxy shut down. remember to be polite and courteous for best & fastest results

gostats, Oct 3, 2007 IP

Bryce Peon

Messages:: 1,235

Likes Received:: 93

Best Answers:: 0

Trophy Points:: 0

#18

Thanks for the information gostats. I'd read an article a few months ago about a guy who had one of his clients' sites deindexed and content basicly stolen by a proxy. I posted a link to this thread on a few other forums because webmasters should be made aware of this information.

Bryce, Oct 3, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#19

Bryce said: ↑

Thanks for the information gostats. I'd read an article a few months ago about a guy who had one of his clients' sites deindexed and content basicly stolen by a proxy. I posted a link to this thread on a few other forums because webmasters should be made aware of this information.
Click to expand...

Yes, another thing I should mention: It is likely that the client you mentioned did not acutally have his content de-indexed. Rather, it was buried far down in the SERPs and was likely only visible after clicking "repeat the search with ....". (Since that removes a filter) That is the nature of this "bug". It can get very frustrating to the honest webmaster. However, I'm confident we can vaccinate most sites from these effects. Knowledge of this problem in the web-hosting industry would help a lot too.

gostats, Oct 3, 2007 IP

gostats Peon

Messages:: 325

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#20

If you have went through all of the above options and you still have a proxy which is ignoring your noindex, 403 and even emails about this problem (and the NOC is asleep), it may be time to send a DMCA notice to Google. see [ http://www.google.com/dmca.html ]. The DMCA notice is a sort of "last resort" option in protecting your content from the ranking theives. If a proxy operator is careless enough to ignore your legitimate requests (both robotic and human) it's about time to take further action.

Ensure that you file your DMCA report fully and double check it for any errors or omissions.

gostats, Oct 4, 2007 IP

Log in or Sign up

Google's war on proxy sites. (why it exists, and why it's important)

gostats Peon

jinnnguyen Peon

conradbarrett Active Member

gostats Peon

gostats Peon

gostats Peon

speedster Peon

sk33lz Peon

ForgottenCreature Notable Member

gostats Peon

jinnnguyen Peon

Bothel_ Guest

gostats Peon

trichnosis Prominent Member

ThreeGuineaWatch Well-Known Member

trichnosis Prominent Member

gostats Peon

Bryce Peon

gostats Peon

gostats Peon

Useful Searches