Hello, I am having a problem with blocking bots using .htaccess. I think I tried all possible syntax variants, yet all the bots that I am blocking get HTTP 200 response instead of 403 (I can verify it using access log). I am using Apache 2.4 running on Ubuntu 14.04.2 with Plesk 12.0.18. My AllowOverride is set to allow the use of .htaccess files, so .htaccess file gets loaded: when I make an error in .htaccess sysntax I can see the error in the error log and the webpages don't load. Besides, I have some "Deny from [IP address]" directives in the .htaccess and I see that these IPs get HTTP 403 response when access my site. I spent hours trying different variants of .htaccess syntax (see below) and neither seems to work... Any help will be greatly appreciated. variant 0: SetEnvIfNoCase User-Agent LivelapBot bad_bot SetEnvIfNoCase User-Agent TurnitinBot bad_bot Order allow,deny Allow from all Deny from env=bad_bot variant 1: SetEnvIfNoCase User-Agent .*LivelapBot.* bad_bot SetEnvIfNoCase User-Agent .*TurnitinBot.* bad_bot Order allow,deny Allow from all Deny from env=bad_bot variant 2: SetEnvIfNoCase User-Agent ^LivelapBot.* bad_bot SetEnvIfNoCase User-Agent ^TurnitinBot.* bad_bot Order allow,deny Allow from all Deny from env=bad_bot variant 3: <IfModule mod_setenvif.c> SetEnvIfNoCase User-Agent "^LivelapBot.*" bad_bot SetEnvIfNoCase User-Agent "^TurnitinBot.*" bad_bot <Limit GET POST HEAD PUT> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit> </IfModule> variant 4: SetEnvIfNoCase User-Agent ^LivelapBot.* bad_bot SetEnvIfNoCase User-Agent ^TurnitinBot.* bad_bot <Directory "/"> Order Allow,Deny Allow from all Deny from env=bad_bot </Directory> variant 5: RewriteEngine On RewriteCond %{REQUEST_URI} !/robots.txt$ RewriteCond %{HTTP_USER_AGENT} ^LivelapBot [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [NC] RewriteRule ^.*.* [L] variant 6: RewriteEngine On RewriteCond %{REQUEST_URI} !/robots.txt$ RewriteCond %{HTTP_USER_AGENT} ^\*LivelapBot$ [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^\*TurnitinBot$ [NC] RewriteRule ^.* - [F,L] variant 7: RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^.*LivelapBot* [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^.*TurnitinBot* [NC] RewriteRule ^.* - [F,L]
Another quick way to test is to deny from all. As you know, there are several ways of doing it including invoking the rewrite engine. If you do the rewrite, you have to worry about the OR and NC flags. This is what I do and have been using for years. It works. SetEnvIfNoCase User-Agent "^magpie-crawler" bad_bot #SetEnvIfNoCase User-Agent "^baiduspider" bad_bot #SetEnvIfNoCase User-Agent "^baidu" bad_bot #SetEnvIfNoCase User-Agent "^baidu.*" bad_bot #SetEnvIfNoCase User-Agent "^Baiduspider/2.0" bad_bot #SetEnvIfNoCase User-Agent "^Yandex.*" bad_bot #SetEnvIfNoCase User-Agent "^YandexBot" bad_bot Order Allow,Deny Allow from all Deny from env=bad_bot Code (markup): I commented out a few to make them inactive (the # sign). I also went overboard with the variations. The only bot I was blocking on that now non-existent website was the magpie one. To test, just put the user agent of your web browser in there like this (Firefox in my case): SetEnvIfNoCase User-Agent firefox bad_bot Order Allow,Deny Allow from all Deny from env=bad_bot Code (markup): The quotation marks and opening caret "^" may not be necessary. More reading: http://httpd.apache.org/docs/2.4/rewrite/access.html#blocking-of-robots http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html
I did more testing and it works... but only in the root directory where I have static PHP files. Blocking doesn't work in the blog folder where I have WordPress blog with separate .htaccess file. It is very strange as I was under impression that .htaccess file in the root folder takes precedence and the second .htaccess (in the folder) wouldn't even get loaded if access is blocked by the root one. Apparently, it is not the case... I hate the idea of having to insert blocking statements in both .htaccess files. Any thoughts?
Quick update: After spending more time on this I was able to get it working in the following configuration: 1. In my vhost.conf I have the following: ... SetEnvIfNoCase User-Agent LivelapBot bad_bot SetEnvIfNoCase User-Agent TurnitinBot bad_bot ... <Directory /var/www/> Order allow,deny Allow from all Deny from env=bad_bot </Directory> 2. In both .htaccess files (root and blog) I had to add only these 3 lines to make it work: Order allow,deny Allow from all Deny from env=bad_bot It seems like a bit of kludge (re #2) but performance overhead is minimal and it works
I wanted to post an update here in case somebody has the same problem and stumbles upon this thread. I was able to get rid of code fragments in both .htaccess files after I changed directives in vhost.conf from pre-Apache 2.4 style to the new style: ... <Directory /var/www/vhosts/example.com/httpdocs/> <RequireAll> Require all granted Require not env bad_bot </RequireAll> </Directory> Note that I also changed directory path here, otherwise I was blocking access way too high and therefore was blocking my custom 403 Forbidden error file as well, so offenders couldn't see my nice and friendly message to them To be honest I am quite surprised by this, I had mod_access_compat enabled, so I expected the old style to work without .htaccess kludge... oh, well.
Blocking rogue bots via htaccess will not solve issues, if you want to block them to stop resource abuse. It needs to set custom 404, 403 pages with less load to achieve it. Otherwise websites like Wordpress will process the index page for 403/404 pages and blocking bots via htaccess will not help. In such case, blocking using mod_Security will help.
I suspect the bad bots may not adhere to 'noindex', the robots.txt or any other accepted ways of controlling bots. If so, you could create trap, such as a page that the good bots are instructed not to crawl. It would not be in your sitemap either. Anything that crawls it would then have their IP banned. I've seen it done with Wordfence. It may be too complex for .htaccess adjustments. Just a thought.
That seems too complicated... and my current setup (see my message from April 26) works just fine. I have a script parsing my access logs on a weekly basis and add bots to block based on results... so far only a few that aren't blocked already.