1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Google's war on proxy sites. (why it exists, and why it's important)

Discussion in 'Google' started by gostats, Oct 2, 2007.

  1. #1
    I intend this guide to help people who've had their site hijacked from the Google Serps. (Many people just assume that their site has a penalty - and in vain they cannot figure out why)

    A quick primer on the different types of proxies out there:
    -Good type: robot.txt disallows search engines from crawling any proxied content
    -Semi-Bad type: No robots.txt, but doesn't strip "no robots" meta tags or cache content
    -Bad type: No robots.txt, strips "no robots" meta tags and caches content.

    The bad and semi-bad proxies are harmful to your google rankings if they succeed in tricking google into believing that they really "own" your content. Yes, it can and does happen - so beware!

    Why do people do this? Most proxies are setup for the money that comes from displaying ads in the proxied pages. It seems that the bad proxies which happen to steal your rankings might list for some long tail keywords. It is likely that the traffic levels are pretty low since most of the pages have a lower page rank (or no pr) than the original page. The result is making a mess of search engine results.

    Srsly! I can understand why I see so many people selling their black-hat proxy sites. (probably got banned in google after getting caught - maybe even getting a notice of shutdown from their ISP?) BH Proxy owners, be sure to ask yourself this: "Would I be annoyed if my site was jacked from the Serps?" The loss of rankings can be quite rough for the legitimate webmaster.

    I've compiled a list of some things that can be done to combat and prevent these proxies from stealing and jacking your content.

    1). Search for copies of text from your site. Use full quotes to get listings of sites which have copied your content
    2). Check each site to see if it is really copying all of your content or just a snippet.
    3). If the site is a proxy, determine the IP address. (you can do this by pinging the hostname, or even type www.whatismyip.com into the proxy to get a real IP behind the scraper)
    4). Compile a list of the proxies and their IP addresses. You will need to use this list to defend your site.
    5). You will need to have some form of includes in your site enabled: Making a script that will compare the requesting IP address to your list of IP addresses and return the following:
    <meta name="robots" content="noindex,nofollow" />
    Code (markup):
    6). View the copy of your site from each proxy site you've identified and see if it has a copy of your site with the new noindex code.
    7). Some proxies will play fair and you will then be able to delist your site by heading over the Google web controller https://www.google.com/webmasters/tools/removals Be sure to select "Information or image that appears in the Google search results.", then "The site owner has removed this page/image from the web or blocked it from being indexed", then enter your url on the last page and hit submit. Steps 5 to 7 ensure that your content will be removed fast from Google.
    8). For any proxies that strip your "noindex" meta tag, you will need to add them to the 403 list.
    9). When you setup your 403 list, import those urls into your .htaccess file to block the content outright.
    10). You may notice that some proxies revert to a cached version of your page! In this case take down the IP address of these proxies so that you can follow up.
    11). When it comes to dealing with the proxies you can try to email them directly, but getting a response from them is not common. It doesn't hurt to try! If all else fails, you may want to Politely inform their upstream provider. (do an http://arin.net whois search on their IP address - to see who manages their netblock) Usually seek the abuse department for optimum results.
    12). Ensure that you are clear with your requests to any upstream provider. They are usually busy heavily worked people who get too many complaints from total noobs complaining about things that are really non-issues. Use as few words as possible and ensure that you mention that the proxy site in question is not honoring your robot tags, nor your 403 forbidden codes. Be sure to also mention the difference between this proxy and a normal proxy. Remember always be polite... I can't stress this enough... I used to work at an ISP and it can really get annoying when dealing with rude and belligerent requests. Remember folks, NOC personnel are people too. (NOC = Network Operation Center)
    13). Follow up requests for further information or even initial denials with a polite summary and emphasis of your problem. Some NOCs are too busy to fully read your original email - so they may overlook some of your important details buried within the mess.

    So that's a primer on some first steps for anyone who is experiencing this problem. I'll try to add some more details and answer any questions people may have. I have also compiled a list of some IPs to "noindex" and "403" which I will post later too. I also welcome any feedback.
     
    gostats, Oct 2, 2007 IP
    Game Producer, wdwp, Obelia and 3 others like this.
  2. jinnnguyen

    jinnnguyen Peon

    Messages:
    117
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Yes please, i'm the first in line waiting for that compiled list. TIA
     
    jinnnguyen, Oct 2, 2007 IP
  3. conradbarrett

    conradbarrett Active Member

    Messages:
    369
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    55
    #3
    You have msn or aim, would like to talk to you.

    Thanks,
    Conrad
     
    conradbarrett, Oct 2, 2007 IP
  4. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Here's the list so far.
    *I have manually processed these by hand.
    *two or three of these have already been suspended by their ISP.
    *IP address delegations can change. Keep a date associated with each your IP address list and review them occassionally for freshness.
    *do an arin/ripe/apnic lookup on each url to ensure that you are not blocking a yahoo or google etc... (but unlikely since I looked these up)
    *Sorry, I haven't been storing the proxy/scraper site with the IPs. Try doing a reverse lookup or just surf to the IP.
    *Feel free to post your own here, but keep in mind: "a blacklist is only good as long as it is fresh and the false positives are kept to a minimum."

    (sorry for the bad formatting, I'm just copying this output from my server.)

    noindex: generally the scraper responds to insertion of the meta-noindex tag. However, there are a few "noindex" records that I have missed putting a deny tag on.
    deny: just 403 these proxies, they are evil and ignore all tags/no cache/etc...
    -tip: you may want to send garbage bits or garbage text to the deny records instead of a 403. (Since if they get a 403 they will just serve a cache copy of your page. By sending garbage, they will index and proxy garbage to the search engines. *do this if the site has a cache of your site already
    
    noindex 	72.232.177.43
    noindex 	65.98.59.210
    noindex 	70.84.106.153
    noindex 	64.209.134.9
    noindex 	77.232.66.139
    deny 	77.232.66.109
    noindex 	66.246.246.50
    deny 	66.90.104.77
    noindex 	85.12.25.70
    deny 	65.98.59.218
    noindex 	64.85.160.59
    noindex 	66.90.77.2
    noindex 	74.86.43.151
    deny 	208.100.20.148
    noindex 	66.96.95.200
    deny 	208.109.78.122
    deny	72.249.119.71
    deny	202.131.89.100
    deny 	217.16.16.204
    deny 	217.16.16.215
    deny 	83.222.23.243
    deny 	76.163.209.36
    deny 	217.16.16.218
    deny 	83.222.23.200
    deny 	83.222.23.214
    deny 	83.222.23.202
    deny 	66.232.126.136
    deny 	82.146.62.107
    deny 	66.232.126.134
    noindex 	70.84.106.146
    deny 	78.129.131.123
    deny 	70.87.207.234
    deny 	76.162.253.46
    deny 	64.72.126.8
    noindex 	216.86.152.216
    deny 	85.17.58.183
    deny 	216.129.107.106
    deny 	72.36.145.138
    deny 	64.34.166.94
    noindex 	70.183.59.7
    noindex 	66.90.73.236
    noindex 	69.59.22.27
    deny 	74.86.61.234
    deny 	69.59.28.145
    noindex 	193.138.206.207
    noindex 	205.209.146.144
    noindex	203.146.251.107
    noindex	67.159.44.136
    noindex	85.92.130.117
    Code (markup):
     
    gostats, Oct 2, 2007 IP
  5. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #5
    conradbarrett, I've sent a pm to you with my msn... However, just a note: I'm rarely on msn. Please just email or pm me.
     
    gostats, Oct 2, 2007 IP
  6. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #6
    btw: conradbarrett, I've noticed that the site in your signature has some content that seems to be the same as the description of the heros show in wikipedia. It might be best if you create a unique description in your own words. (Basically copying content snippets is also a way to de-vaule your site) As a rule of thumb never copy content unless you are given expressed permission to do so. - and then follow the terms for it.

    Just a remnder: If you ever feel that you have a duplicate content penalty, check first that you have not copied any snippets from around the web. If another site says it better than you can, just link to them.
     
    gostats, Oct 2, 2007 IP
  7. speedster

    speedster Peon

    Messages:
    438
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #7
    good work, thanks for the list
     
    speedster, Oct 2, 2007 IP
  8. sk33lz

    sk33lz Peon

    Messages:
    37
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks for the info! I have not had any issues with this yet, but I am sure this list will help keep most of those problems away. Thanks :D
     
    sk33lz, Oct 2, 2007 IP
  9. ForgottenCreature

    ForgottenCreature Notable Member

    Messages:
    7,456
    Likes Received:
    170
    Best Answers:
    0
    Trophy Points:
    260
    #9
    Couldn't a script just been made to block all proxy sites from accessing your site?
     
    ForgottenCreature, Oct 2, 2007 IP
  10. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Seems pretty easy doesn't it?
    ...Well only if the proxies play by the rules (which they don't).

    For example, if they pass on a client's user agent (like IE or FireFox), then it is difficult to determine that the resource is a proxy. (very process intensive if your site is under any load.)

    From a programmer standpoint, a good start is a well updated list of bad behaving proxies. Skim the SE listings daily or put a google alert out for copies of your content to find copies of your site.

    If anyone has any suggestions on automating this bad-proxy-discovery process, I'm all ears! ;)
     
    gostats, Oct 2, 2007 IP
  11. jinnnguyen

    jinnnguyen Peon

    Messages:
    117
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Gostats, much appreciated for the list

    Regards,
     
    jinnnguyen, Oct 3, 2007 IP
  12. Bothel_

    Bothel_ Guest

    Messages:
    36
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I'm running 2 proxies atm. Thanks for the help! :D
     
    Bothel_, Oct 3, 2007 IP
  13. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Remeber: Play nicely or someone may send a notice to your provider. :eek:
    srsly though, if you want your proxies to stay legit and functioning you should ensure that you honor the no-robots meta tags. Even going as far as robot-blocking the proxied content will be both noble and really solidify your proxy. ...grumpy webmasters with hijacked content could even cause more troubles than just a banned proxy. (not all webmasters will have the resources to meta-block the proxied content - but still would not want to be scraped by proxies.)
     
    gostats, Oct 3, 2007 IP
  14. trichnosis

    trichnosis Prominent Member

    Messages:
    13,785
    Likes Received:
    333
    Best Answers:
    0
    Trophy Points:
    300
    #14
    most of the proxy sites are removing it.

    people must allways check their web sites against the proxy cheaters
     
    trichnosis, Oct 3, 2007 IP
  15. ThreeGuineaWatch

    ThreeGuineaWatch Well-Known Member

    Messages:
    1,489
    Likes Received:
    69
    Best Answers:
    0
    Trophy Points:
    140
    #15
    One thing that might help (that doesn't require using that much noddle) is to ask the authors of the popular proxy scripts to include a barebones robots.txt with the script. I assume the majority of proxy owners are unknowingly aiding those with malicious intent simply by not being aware.
     
    ThreeGuineaWatch, Oct 3, 2007 IP
  16. trichnosis

    trichnosis Prominent Member

    Messages:
    13,785
    Likes Received:
    333
    Best Answers:
    0
    Trophy Points:
    300
    #16
    trichnosis, Oct 3, 2007 IP
  17. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #17
    Interesting solution. It would be good to run this only once per IP address - since multiple tests on the same IP address would waste time.
    -I'm not sure if checking the IP address for a responsive port 80 is always the best way: because, what if the person has a personal webserver that is not a proxy. Or what if the person is working from a work place where their company website is responding to their IP address? (rare, but it's a case to consider)

    update: There is a case where a proxy operator can spoof the no robots meta tag to normal clients, but strip it when Google spiders the page. You may even discover using the Google web controller: https://www.google.com/webmasters/tools/removals that their site does not get removed because they stripped the noindex tag. Or if you view the site with google translator (do a spanish to english) In this case, be sure to set the offender's IP to 403 forbidden and if needed, contact the IP netblock owner. Bascially if the proxy owner is going out of his/her way to scrape and rank with your content, it's best to just get the proxy shut down. remember to be polite and courteous for best & fastest results
     
    gostats, Oct 3, 2007 IP
  18. Bryce

    Bryce Peon

    Messages:
    1,235
    Likes Received:
    93
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Thanks for the information gostats. I'd read an article a few months ago about a guy who had one of his clients' sites deindexed and content basicly stolen by a proxy. I posted a link to this thread on a few other forums because webmasters should be made aware of this information.
     
    Bryce, Oct 3, 2007 IP
  19. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Yes, another thing I should mention: It is likely that the client you mentioned did not acutally have his content de-indexed. Rather, it was buried far down in the SERPs and was likely only visible after clicking "repeat the search with ....". (Since that removes a filter) That is the nature of this "bug". It can get very frustrating to the honest webmaster. However, I'm confident we can vaccinate most sites from these effects. Knowledge of this problem in the web-hosting industry would help a lot too.
     
    gostats, Oct 3, 2007 IP
  20. gostats

    gostats Peon

    Messages:
    325
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #20
    If you have went through all of the above options and you still have a proxy which is ignoring your noindex, 403 and even emails about this problem (and the NOC is asleep), it may be time to send a DMCA notice to Google. see [ http://www.google.com/dmca.html ]. The DMCA notice is a sort of "last resort" option in protecting your content from the ranking theives. If a proxy operator is careless enough to ignore your legitimate requests (both robotic and human) it's about time to take further action.

    Ensure that you file your DMCA report fully and double check it for any errors or omissions.
     
    gostats, Oct 4, 2007 IP