I intend this guide to help people who've had their site hijacked from the Google Serps. (Many people just assume that their site has a penalty - and in vain they cannot figure out why) A quick primer on the different types of proxies out there: -Good type: robot.txt disallows search engines from crawling any proxied content -Semi-Bad type: No robots.txt, but doesn't strip "no robots" meta tags or cache content -Bad type: No robots.txt, strips "no robots" meta tags and caches content. The bad and semi-bad proxies are harmful to your google rankings if they succeed in tricking google into believing that they really "own" your content. Yes, it can and does happen - so beware! Why do people do this? Most proxies are setup for the money that comes from displaying ads in the proxied pages. It seems that the bad proxies which happen to steal your rankings might list for some long tail keywords. It is likely that the traffic levels are pretty low since most of the pages have a lower page rank (or no pr) than the original page. The result is making a mess of search engine results. Srsly! I can understand why I see so many people selling their black-hat proxy sites. (probably got banned in google after getting caught - maybe even getting a notice of shutdown from their ISP?) BH Proxy owners, be sure to ask yourself this: "Would I be annoyed if my site was jacked from the Serps?" The loss of rankings can be quite rough for the legitimate webmaster. I've compiled a list of some things that can be done to combat and prevent these proxies from stealing and jacking your content. 1). Search for copies of text from your site. Use full quotes to get listings of sites which have copied your content 2). Check each site to see if it is really copying all of your content or just a snippet. 3). If the site is a proxy, determine the IP address. (you can do this by pinging the hostname, or even type www.whatismyip.com into the proxy to get a real IP behind the scraper) 4). Compile a list of the proxies and their IP addresses. You will need to use this list to defend your site. 5). You will need to have some form of includes in your site enabled: Making a script that will compare the requesting IP address to your list of IP addresses and return the following: <meta name="robots" content="noindex,nofollow" /> Code (markup): 6). View the copy of your site from each proxy site you've identified and see if it has a copy of your site with the new noindex code. 7). Some proxies will play fair and you will then be able to delist your site by heading over the Google web controller https://www.google.com/webmasters/tools/removals Be sure to select "Information or image that appears in the Google search results.", then "The site owner has removed this page/image from the web or blocked it from being indexed", then enter your url on the last page and hit submit. Steps 5 to 7 ensure that your content will be removed fast from Google. 8). For any proxies that strip your "noindex" meta tag, you will need to add them to the 403 list. 9). When you setup your 403 list, import those urls into your .htaccess file to block the content outright. 10). You may notice that some proxies revert to a cached version of your page! In this case take down the IP address of these proxies so that you can follow up. 11). When it comes to dealing with the proxies you can try to email them directly, but getting a response from them is not common. It doesn't hurt to try! If all else fails, you may want to Politely inform their upstream provider. (do an http://arin.net whois search on their IP address - to see who manages their netblock) Usually seek the abuse department for optimum results. 12). Ensure that you are clear with your requests to any upstream provider. They are usually busy heavily worked people who get too many complaints from total noobs complaining about things that are really non-issues. Use as few words as possible and ensure that you mention that the proxy site in question is not honoring your robot tags, nor your 403 forbidden codes. Be sure to also mention the difference between this proxy and a normal proxy. Remember always be polite... I can't stress this enough... I used to work at an ISP and it can really get annoying when dealing with rude and belligerent requests. Remember folks, NOC personnel are people too. (NOC = Network Operation Center) 13). Follow up requests for further information or even initial denials with a polite summary and emphasis of your problem. Some NOCs are too busy to fully read your original email - so they may overlook some of your important details buried within the mess. So that's a primer on some first steps for anyone who is experiencing this problem. I'll try to add some more details and answer any questions people may have. I have also compiled a list of some IPs to "noindex" and "403" which I will post later too. I also welcome any feedback.
Here's the list so far. *I have manually processed these by hand. *two or three of these have already been suspended by their ISP. *IP address delegations can change. Keep a date associated with each your IP address list and review them occassionally for freshness. *do an arin/ripe/apnic lookup on each url to ensure that you are not blocking a yahoo or google etc... (but unlikely since I looked these up) *Sorry, I haven't been storing the proxy/scraper site with the IPs. Try doing a reverse lookup or just surf to the IP. *Feel free to post your own here, but keep in mind: "a blacklist is only good as long as it is fresh and the false positives are kept to a minimum." (sorry for the bad formatting, I'm just copying this output from my server.) noindex: generally the scraper responds to insertion of the meta-noindex tag. However, there are a few "noindex" records that I have missed putting a deny tag on. deny: just 403 these proxies, they are evil and ignore all tags/no cache/etc... -tip: you may want to send garbage bits or garbage text to the deny records instead of a 403. (Since if they get a 403 they will just serve a cache copy of your page. By sending garbage, they will index and proxy garbage to the search engines. *do this if the site has a cache of your site already noindex 72.232.177.43 noindex 65.98.59.210 noindex 70.84.106.153 noindex 64.209.134.9 noindex 77.232.66.139 deny 77.232.66.109 noindex 66.246.246.50 deny 66.90.104.77 noindex 85.12.25.70 deny 65.98.59.218 noindex 64.85.160.59 noindex 66.90.77.2 noindex 74.86.43.151 deny 208.100.20.148 noindex 66.96.95.200 deny 208.109.78.122 deny 72.249.119.71 deny 202.131.89.100 deny 217.16.16.204 deny 217.16.16.215 deny 83.222.23.243 deny 76.163.209.36 deny 217.16.16.218 deny 83.222.23.200 deny 83.222.23.214 deny 83.222.23.202 deny 66.232.126.136 deny 82.146.62.107 deny 66.232.126.134 noindex 70.84.106.146 deny 78.129.131.123 deny 70.87.207.234 deny 76.162.253.46 deny 64.72.126.8 noindex 216.86.152.216 deny 85.17.58.183 deny 216.129.107.106 deny 72.36.145.138 deny 64.34.166.94 noindex 70.183.59.7 noindex 66.90.73.236 noindex 69.59.22.27 deny 74.86.61.234 deny 69.59.28.145 noindex 193.138.206.207 noindex 205.209.146.144 noindex 203.146.251.107 noindex 67.159.44.136 noindex 85.92.130.117 Code (markup):
conradbarrett, I've sent a pm to you with my msn... However, just a note: I'm rarely on msn. Please just email or pm me.
btw: conradbarrett, I've noticed that the site in your signature has some content that seems to be the same as the description of the heros show in wikipedia. It might be best if you create a unique description in your own words. (Basically copying content snippets is also a way to de-vaule your site) As a rule of thumb never copy content unless you are given expressed permission to do so. - and then follow the terms for it. Just a remnder: If you ever feel that you have a duplicate content penalty, check first that you have not copied any snippets from around the web. If another site says it better than you can, just link to them.
Thanks for the info! I have not had any issues with this yet, but I am sure this list will help keep most of those problems away. Thanks
Seems pretty easy doesn't it? ...Well only if the proxies play by the rules (which they don't). For example, if they pass on a client's user agent (like IE or FireFox), then it is difficult to determine that the resource is a proxy. (very process intensive if your site is under any load.) From a programmer standpoint, a good start is a well updated list of bad behaving proxies. Skim the SE listings daily or put a google alert out for copies of your content to find copies of your site. If anyone has any suggestions on automating this bad-proxy-discovery process, I'm all ears!
Remeber: Play nicely or someone may send a notice to your provider. srsly though, if you want your proxies to stay legit and functioning you should ensure that you honor the no-robots meta tags. Even going as far as robot-blocking the proxied content will be both noble and really solidify your proxy. ...grumpy webmasters with hijacked content could even cause more troubles than just a banned proxy. (not all webmasters will have the resources to meta-block the proxied content - but still would not want to be scraped by proxies.)
most of the proxy sites are removing it. people must allways check their web sites against the proxy cheaters
One thing that might help (that doesn't require using that much noddle) is to ask the authors of the popular proxy scripts to include a barebones robots.txt with the script. I assume the majority of proxy owners are unknowingly aiding those with malicious intent simply by not being aware.
i totally agree with you. http://forums.digitalpoint.com/showthread.php?t=500687 contains my solution for the proxy cheater.
Interesting solution. It would be good to run this only once per IP address - since multiple tests on the same IP address would waste time. -I'm not sure if checking the IP address for a responsive port 80 is always the best way: because, what if the person has a personal webserver that is not a proxy. Or what if the person is working from a work place where their company website is responding to their IP address? (rare, but it's a case to consider) update: There is a case where a proxy operator can spoof the no robots meta tag to normal clients, but strip it when Google spiders the page. You may even discover using the Google web controller: https://www.google.com/webmasters/tools/removals that their site does not get removed because they stripped the noindex tag. Or if you view the site with google translator (do a spanish to english) In this case, be sure to set the offender's IP to 403 forbidden and if needed, contact the IP netblock owner. Bascially if the proxy owner is going out of his/her way to scrape and rank with your content, it's best to just get the proxy shut down. remember to be polite and courteous for best & fastest results
Thanks for the information gostats. I'd read an article a few months ago about a guy who had one of his clients' sites deindexed and content basicly stolen by a proxy. I posted a link to this thread on a few other forums because webmasters should be made aware of this information.
Yes, another thing I should mention: It is likely that the client you mentioned did not acutally have his content de-indexed. Rather, it was buried far down in the SERPs and was likely only visible after clicking "repeat the search with ....". (Since that removes a filter) That is the nature of this "bug". It can get very frustrating to the honest webmaster. However, I'm confident we can vaccinate most sites from these effects. Knowledge of this problem in the web-hosting industry would help a lot too.
If you have went through all of the above options and you still have a proxy which is ignoring your noindex, 403 and even emails about this problem (and the NOC is asleep), it may be time to send a DMCA notice to Google. see [ http://www.google.com/dmca.html ]. The DMCA notice is a sort of "last resort" option in protecting your content from the ranking theives. If a proxy operator is careless enough to ignore your legitimate requests (both robotic and human) it's about time to take further action. Ensure that you file your DMCA report fully and double check it for any errors or omissions.