1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

The Ultimate Blocking Bots Thread- spyderspanker/.htaccess/robots.txt For 301 Redirects And PBN Site

Discussion in 'Search Engine Optimization' started by Kyle17, Jun 18, 2014.

  1. #1
    WE NEED TO BLOCK ALL BAD BOTS FROM CRAWLING OUR SITES.
    (SOME RESPONSES FROM ANOTHER THREAD MADE IN PRIVATE FORUM HAVE BEEN POSTED HERE JUST TO GET THREAD GOING)
    The problem is, not all methods work all the time, and sometimes you get a few links "leaking" for whatever reason.
    Or even worse, all your links show up in a certain tool while maybe not in others.
    However, this should be of the utmost priority for you if you are managing an expensive blog network.
    I would like to make this thread the place for us to compare information regarding what is working and what simply doesn't.
    You can read some info here on what I have tried with SpyderSpanker, a custom IP list and issues with that (link not working here)
    Now I am going to start using redirects for various reasons to rank some sites.
    And I need to hide these redirects from the link tools as well.
    I will be testing out different code and reporting my findings.
    A good thread I found on this is at http://www.blackhatunderground.net/forum/blackhat-seo-marketing-and-traffic-generation/how-to-stop-competitors-noticing-you-use-301s-to-rank/
    I don't yet understand all the intricacies of robots.txt and .htaccess and will just be testing this stuff out on sites I am using for redirects.
    Some people suggest using this in .htaccess: (of course you should put all the main bots in, rogerbot, ahrefsbot, sitebot, mj12bot, etc.)
    RewriteEngine On
    RewriteCond %{REQUEST_URI} !/robots.txt$
    RewriteCond %{HTTP_USER_AGENT} ^.*BLEXBot.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*BlackWidow.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Nutch.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Jetbot.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*WebVac.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Stanford.*$ [NC,OR]
    RewriteRule ^.*.* [L]

    In conjunction with the robots.txt:
    User-agent: BLEXBot
    User-agent: BlackWidow
    User-agent: Nutch
    User-agent: Jetbot
    User-agent: WebVac
    User-agent: Standford
    Disallow: /

    While others are using:

    SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot
    SetEnvIfNoCase User-Agent .*exabot.* bad_bot
    SetEnvIfNoCase User-Agent .*mj12bot.* bad_bot
    SetEnvIfNoCase User-Agent .*dotbot.* bad_bot
    SetEnvIfNoCase User-Agent .*gigabot.* bad_bot
    SetEnvIfNoCase User-Agent .*ahrefsbot.* bad_bot
    SetEnvIfNoCase User-Agent .*sitebot.* bad_bot
    <Limit GET POST HEAD>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Limit>

    The "RewriteEngine On" code seems to be more believable to work, but the problem I encounter is that it does not work on many hosts, and their support is not equipped to explain why(if you are redirecting a site with that code in your .htaccess(rather than the "SetEnvIfNoCase" code)- the site will just show an error page and not redirect on some hosts).
    Anyways, if anybody can contribute anything here that would be awesome- I am off to bang my head against the wall figuring it out in the meantime.

    ======================================================
    For me from some reason, trying to block crawlers with the .htaccess file didn't work.
    Furethermore, I've noticed that spyder spanker doesn't work with some hosts so sometimes I use a plugin called Bluagent.
    I've never tried blocking anything with robots.txt cause I'm sure a lot of crawlers don't obey it anyway.
    ======================================================
    Hey Smoker, what htaccess code did you use exactly?

    SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot
    SetEnvIfNoCase User-Agent .*exabot.* bad_bot
    SetEnvIfNoCase User-Agent .*mj12bot.* bad_bot
    SetEnvIfNoCase User-Agent .*dotbot.* bad_bot
    SetEnvIfNoCase User-Agent .*gigabot.* bad_bot
    SetEnvIfNoCase User-Agent .*ahrefsbot.* bad_bot
    SetEnvIfNoCase User-Agent .*sitebot.* bad_bot
    <Limit GET POST HEAD>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Limit>

    or something like

    RewriteEngine On
    RewriteCond %{REQUEST_URI} !/robots.txt$
    RewriteCond %{HTTP_USER_AGENT} ^.*BLEXBot.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*BlackWidow.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Nutch.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Jetbot.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*WebVac.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Stanford.*$ [NC,OR]
    RewriteRule ^.*.* [L]

    Spyderspanker requires ioncube and php 5.4 on the server from my understanding- that's why it doesn't work sometimes to activate and save settings at least. Are you saying it just didn't work even when activated and custom bots/ips added?

    My experience is that spyderspanker blocks majestic perfectly but not ose and ahrefs(at least not with the ip/bot list i have used so far).
    ======================================================
    For me the SetEnvIfNoCase method hasn't worked on any hosting account I've tried it on (no errors, it just did nothing!). Whereas the Rewrite method worked and is what I'm using right now. Still don't know why SetEnvIfNoCase didn't work as both should be fine!
    Personally I don't use the Robots files at all for this -- partly because it's up to the spider if it obeys the rule or not and partly because it's a publicly readable file – so potentially leaves a footprint (every robots.txt will be identical…).
    You can test if the blocks are working using a Firefox plugin like User Agent Overrider.
    Anyone confused by all this and worried they might break their sites should probably stick with Wordpress/Spyder Spanker (AND follow the authors instructions to add the extra agents to block).
    Just as a side note for general info:-
    robots.txt is a file spiders may read and may choose to obey (depending on how they have been programmed)
    .htaccess tells your web server what to do when anyone visits, so if you deny a particular user-agent outright then anything using that user-agent cannot access your site (i.e. it's no longer up to them to obey)
    BUT, in both cases it depends 100% on the spider identifying itself in the user-agent data it sends. Most spiders behave themselves and identify themselves properly. The only way to block them if they don't is to block the IP addresses they use.
    ======================================================
    Great info Martin.
    I want to use the "Rewrite" method as well.
    Do you ever have problems getting the "Rewrite" method to work on a host?
    I have tried using the "Rewrite" method on 2 different hosts(to block sites that are 301'd to another site) and in both cases, it results in an error page(and not a successful redirect).

    I understand the .htaccess vs. robots.txt functionality difference, but I found this interesting thread that suggests an important use for the robots.txt at http://www.blackhatunderground.net/forum/blackhat-seo-marketing-and-traffic-generation/how-to-stop-competitors-noticing-you-use-301s-to-rank/ where the guy says(not sure how true it is):
    "I made a big post about this on WF a year back. There are some errors in the above code, and the logic is a little incorrect. For example, if you block the access in the .htaccess file you need to also block it in the robots.txt file – BUT here is the kicker, You still have to give access to the bots so they can access only the robots.txt so they know to dump your data. If you only do the above code, it will not erase your site from Archive.org for example. Archive.org only erases stuff if it's blocked in the robots.txt, but since you are doing a .htaccess direct, it never gets to it, so the best thing to do is allow the robots to only access to the robots.txt file. Ahrefs, SEOMoz, and Majestic have the same policy, they will keep the data unless they see they are blocked in the robots.txt.
    Make you're .htaccess exclude the blocking for the robots.txt like this (Notice line 2):
    RewriteEngine On
    RewriteCond %{REQUEST_URI} !/robots.txt$
    RewriteCond %{HTTP_USER_AGENT} ^.*BLEXBot.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*BlackWidow.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Nutch.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Jetbot.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*WebVac.*$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*Stanford.*$ [NC,OR]
    RewriteRule ^.*.* [L]
    And Now the Robots file:
    User-agent: BLEXBot
    User-agent: BlackWidow
    User-agent: Nutch
    User-agent: Jetbot
    User-agent: WebVac
    User-agent: Standford
    Disallow: /
    ======================================================
    Hey Kyle,
    I used Better WP Security plugin and it has an option to block bots by their user-agent.
    It generated a code like this for me:
    RewriteCond %{HTTP_USER_AGENT} ^Ahrefsbot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR]
    Again, I don't know anything about Apache but this simply didn't work.
    ======================================================
    I haven't had problems with the Rewrite method on any host… not yet anyway :)

    Do you know what error is being returned for your 301 redirects? Another useful plugin for Firefox is 'Live Http Headers' – it lets you capture exactly what headers are being sent to more clearly see things like redirects.

    It's certainly true that blanket denying a spider access using .htaccess will stop it seeing any rules that may apply to it in robots.txt (as it literally cannot read the file and will get a 403 error returned when it tries). But what they do as a result of that denied access is up to them – archive.org do specifically say they will remove records if denied in robots.txt, but Majestic etc probably keep data once they have it for a lot longer regardless of what you do (not sure).
    ======================================================
    I would use the rewrite
    Ask your web host if your code your adding in to .htaccess correct telling them what you want to do. save that brain power for seo
    ======================================================
    Did you guys try the link privacy plugin by Jerry West? http://linkprivacy.com It's free and it works well for me. I never see my links in majesticSEO and other similar tools. I don't know if it works well with redirects though. It may be worth a try.
    ======================================================
    @Chris Stewart I find that most hosts don't have any knowledgeable support as they are cheap hourly emnployees that don't have any real technical knowledge.
    @palev94 I haven't tried linkprivacy. How long have you been using it? You never see the links in any OSE, Ahrefs, or majestic?

    One thing I'm now trying to figure out is how to properly setup a 301 redirect in the first place.
    I recently setup a 301 redirect directly in my godaddy account and the site I redirected to showed up in the SERPS exactly in the original site's place the next day.
    Then I learned I needed to be able to access the .htaccess file, so I pointed the nameservers to a host, went in cpanel, and setup a 301 redirect. However, the rankings dropped about 4 spots from number 8 to number 12 immediately after changing the redirect style like that. Does anyone have any insight on this?
    ======================================================
    Kyle Campbell said
    @palev94 I haven't tried linkprivacy. How long have you been using it? You never see the links in any OSE, Ahrefs, or majestic?
    I was going to say no, but I just checked before I reply and I saw a few links in majestic SEO nothing in OSE though.
    ======================================================
    Kyle Campbell said
    One thing I'm now trying to figure out is how to properly setup a 301 redirect in the first place.
    I recently setup a 301 redirect directly in my godaddy account and the site I redirected to showed up in the SERPS exactly in the original site's place the next day.
    Then I learned I needed to be able to access the .htaccess file, so I pointed the nameservers to a host, went in cpanel, and setup a 301 redirect. However, the rankings dropped about 4 spots from number 8 to number 12 immediately after changing the redirect style like that. Does anyone have any insight on this?
    There shouldn't be any difference at all between those methods. A 301 is a 301. Unless Godaddy do something odd – but your results look like they probably don't (you can verify using the headers plugin I mentioned above, but test deeper URL's as well as the homepage if relevant). Google will show the destination site under a search for the 301'd site once it picks up the change. That's normal. They're trying to be helpful to the user and say "although you looked for this specifically, you're going to end up here instead".
    I think it's too soon to attribute the drop to the change, these days they seem to just randomise when cause takes effect!
    ======================================================

    ======================================================

    ======================================================
     
    Kyle17, Jun 18, 2014 IP