1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Blocking bad bots with .htaccess - what is the right syntax?

Discussion in 'Apache' started by Jeffr2014, Apr 23, 2015.

  1. #1
    Hello,
    I am having a problem with blocking bots using .htaccess. I think I tried all possible syntax variants, yet all the bots that I am blocking get HTTP 200 response instead of 403 (I can verify it using access log).

    I am using Apache 2.4 running on Ubuntu 14.04.2 with Plesk 12.0.18.

    My AllowOverride is set to allow the use of .htaccess files, so .htaccess file gets loaded: when I make an error in .htaccess sysntax I can see the error in the error log and the webpages don't load. Besides, I have some "Deny from [IP address]" directives in the .htaccess and I see that these IPs get HTTP 403 response when access my site.

    I spent hours trying different variants of .htaccess syntax (see below) and neither seems to work... Any help will be greatly appreciated.

    variant 0:

    SetEnvIfNoCase User-Agent LivelapBot bad_bot
    SetEnvIfNoCase User-Agent TurnitinBot bad_bot
    Order allow,deny
    Allow from all
    Deny from env=bad_bot

    variant 1:

    SetEnvIfNoCase User-Agent .*LivelapBot.* bad_bot
    SetEnvIfNoCase User-Agent .*TurnitinBot.* bad_bot
    Order allow,deny
    Allow from all
    Deny from env=bad_bot

    variant 2:

    SetEnvIfNoCase User-Agent ^LivelapBot.* bad_bot
    SetEnvIfNoCase User-Agent ^TurnitinBot.* bad_bot
    Order allow,deny
    Allow from all
    Deny from env=bad_bot

    variant 3:
    SEMrush
    <IfModule mod_setenvif.c>
    SetEnvIfNoCase User-Agent "^LivelapBot.*" bad_bot
    SetEnvIfNoCase User-Agent "^TurnitinBot.*" bad_bot
    <Limit GET POST HEAD PUT>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Limit>
    </IfModule>

    variant 4:

    SetEnvIfNoCase User-Agent ^LivelapBot.* bad_bot
    SetEnvIfNoCase User-Agent ^TurnitinBot.* bad_bot
    <Directory "/">
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Directory>

    variant 5:

    RewriteEngine On
    RewriteCond %{REQUEST_URI} !/robots.txt$
    RewriteCond %{HTTP_USER_AGENT} ^LivelapBot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [NC]
    RewriteRule ^.*.* [L]

    variant 6:

    RewriteEngine On
    RewriteCond %{REQUEST_URI} !/robots.txt$
    RewriteCond %{HTTP_USER_AGENT} ^\*LivelapBot$ [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^\*TurnitinBot$ [NC]
    RewriteRule ^.* - [F,L]

    variant 7:

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^.*LivelapBot* [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^.*TurnitinBot* [NC]
    RewriteRule ^.* - [F,L]
     
    Jeffr2014, Apr 23, 2015 IP
    SEMrush
  2. billzo

    billzo Well-Known Member

    Messages:
    961
    Likes Received:
    278
    Best Answers:
    15
    Trophy Points:
    113
    #2
    Another quick way to test is to deny from all.

    As you know, there are several ways of doing it including invoking the rewrite engine. If you do the rewrite, you have to worry about the OR and NC flags.

    This is what I do and have been using for years. It works.

    
    SetEnvIfNoCase User-Agent "^magpie-crawler" bad_bot
    #SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
    #SetEnvIfNoCase User-Agent "^baidu" bad_bot
    #SetEnvIfNoCase User-Agent "^baidu.*" bad_bot
    #SetEnvIfNoCase User-Agent "^Baiduspider/2.0" bad_bot
    #SetEnvIfNoCase User-Agent "^Yandex.*" bad_bot
    #SetEnvIfNoCase User-Agent "^YandexBot" bad_bot
    
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    
    Code (markup):
    I commented out a few to make them inactive (the # sign). I also went overboard with the variations. The only bot I was blocking on that now non-existent website was the magpie one.

    To test, just put the user agent of your web browser in there like this (Firefox in my case):

    
    SetEnvIfNoCase User-Agent firefox bad_bot
    
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    
    Code (markup):
    The quotation marks and opening caret "^" may not be necessary. :)

    More reading:

    http://httpd.apache.org/docs/2.4/rewrite/access.html#blocking-of-robots
    http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html
     
    billzo, Apr 23, 2015 IP
  3. Jeffr2014

    Jeffr2014 Active Member

    Messages:
    254
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    55
    #3
    I did more testing and it works... but only in the root directory where I have static PHP files. Blocking doesn't work in the blog folder where I have WordPress blog with separate .htaccess file. It is very strange as I was under impression that .htaccess file in the root folder takes precedence and the second .htaccess (in the folder) wouldn't even get loaded if access is blocked by the root one. Apparently, it is not the case... I hate the idea of having to insert blocking statements in both .htaccess files. Any thoughts?
     
    Last edited: Apr 24, 2015
    Jeffr2014, Apr 24, 2015 IP
  4. Jeffr2014

    Jeffr2014 Active Member

    Messages:
    254
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    55
    #4
    Quick update: After spending more time on this I was able to get it working in the following configuration:
    1. In my vhost.conf I have the following:
    ...
    SetEnvIfNoCase User-Agent LivelapBot bad_bot
    SetEnvIfNoCase User-Agent TurnitinBot bad_bot
    ...
    <Directory /var/www/>
    Order allow,deny
    Allow from all
    Deny from env=bad_bot
    </Directory>

    2. In both .htaccess files (root and blog) I had to add only these 3 lines to make it work:
    Order allow,deny
    Allow from all
    Deny from env=bad_bot

    It seems like a bit of kludge (re #2) but performance overhead is minimal and it works :)
     
    Last edited: Apr 24, 2015
    Jeffr2014, Apr 24, 2015 IP
  5. Jeffr2014

    Jeffr2014 Active Member

    Messages:
    254
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    55
    #5
    I wanted to post an update here in case somebody has the same problem and stumbles upon this thread. I was able to get rid of code fragments in both .htaccess files after I changed directives in vhost.conf from pre-Apache 2.4 style to the new style:
    ...
    <Directory /var/www/vhosts/example.com/httpdocs/>
    <RequireAll>
    Require all granted
    Require not env bad_bot
    </RequireAll>
    </Directory>

    Note that I also changed directory path here, otherwise I was blocking access way too high and therefore was blocking my custom 403 Forbidden error file as well, so offenders couldn't see my nice and friendly message to them :)

    To be honest I am quite surprised by this, I had mod_access_compat enabled, so I expected the old style to work without .htaccess kludge... oh, well.
     
    Jeffr2014, Apr 26, 2015 IP
  6. TheSHosting

    TheSHosting Member

    Messages:
    24
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    48
    #6
    Blocking rogue bots via htaccess will not solve issues, if you want to block them to stop resource abuse. It needs to set custom 404, 403 pages with less load to achieve it. Otherwise websites like Wordpress will process the index page for 403/404 pages and blocking bots via htaccess will not help. In such case, blocking using mod_Security will help.
     
    TheSHosting, May 5, 2015 IP
  7. usasportstraining

    usasportstraining Notable Member

    Messages:
    4,877
    Likes Received:
    363
    Best Answers:
    0
    Trophy Points:
    280
    Articles:
    4
    #7
    I suspect the bad bots may not adhere to 'noindex', the robots.txt or any other accepted ways of controlling bots.

    If so, you could create trap, such as a page that the good bots are instructed not to crawl. It would not be in your sitemap either. Anything that crawls it would then have their IP banned. I've seen it done with Wordfence. It may be too complex for .htaccess adjustments. Just a thought.
     
    usasportstraining, May 5, 2015 IP
  8. Jeffr2014

    Jeffr2014 Active Member

    Messages:
    254
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    55
    #8
    That seems too complicated... and my current setup (see my message from April 26) works just fine. I have a script parsing my access logs on a weekly basis and add bots to block based on results... so far only a few that aren't blocked already.
     
    Jeffr2014, May 6, 2015 IP