Keeping The Web Crawlers Out

bgbevan Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

I would be considered an Apache "novice hobbyist" and a Mac semi-pro. I wanted to use the OS X web sharing to transfer files that were too large to email. I set up all the linkage and a route into the web server through Port 81. (This was an exercise all in itself.) I disabled "index.html" so users would just see the contents of the directory. I built a .htaccess file in case I needed to block any crawlers or other nefarious interlopers. I modified the "httpd.conf" to allow separate instructions from .htaccess. I tested the .htaccess file by blocking a query from my own site and got a "Forbidden....." message so assumed it was working correctly. Everything worked fine as far as access and transferring of the files. In this example, I was transferring an MP4 video through the web server.

ls -a /Library/WebServer/Documents
.
..
.DS_Store
.htaccess
Forget You.m4v
PoweredByMacOSX.gif
PoweredByMacOSXLarge.gif
index(on hold).html

Then I started to get hits from a Google webcrawler from address 66.249.71.xx. I couldn't figure out what "GET /manual/...." was accessing but I put a "deny from" in the .htaccess file.

/Library/WebServer/Documents/.htaccess
order allow,deny
# Google crawler
deny from 66.249
# Google crawler
deny from 66.249.71.102
# ??
deny from 123.125.71.35
allow from all

But the hits kept coming and I finally figured it was hitting the "Apache Manual" hyperlink inside the "index(on hold).html" file.

/var/log/apache/access_log
66.249.71.102 - - [21/Apr/2011:19:55:33 -0700] "GET /manual/ko/mod/mod_alias.html HTTP/1.1" 200 20639
66.249.71.102 - - [21/Apr/2011:21:12:23 -0700] "GET /manual/es/ko/howto/htaccess.html HTTP/1.1" 301 261
66.249.71.102 - - [21/Apr/2011:21:12:23 -0700] "GET /manual/ko/howto/htaccess.html HTTP/1.1" 200 18054
66.249.71.102 - - [21/Apr/2011:22:32:00 -0700] "GET /manual/es/ko/mod/mod_cache.html HTTP/1.1" 301 260
66.249.71.102 - - [21/Apr/2011:22:32:01 -0700] "GET /manual/ko/mod/mod_cache.html HTTP/1.1" 200 22883

I thought .htaccess would protect all the downstream access. What mechanism is at work here? Does the "/" in the path "/manual..." mean from the current location downwards? Does this mean you can't leave hyperlinks laying around in your web server files and folder(s)?

Bill Bevan

Last edited: Apr 24, 2011

bgbevan, Apr 23, 2011 IP

MartinPrestovic Peon

Messages:: 213

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#2

The .htaccess file will only block the robots in it's current path and below. So in your case:

/Library/WebServer/Documents/

Would be blocked and anything below that, so if you set up:

/Library/WebServer/Documents/Test/

That would also be blocked.

The problem is that Google probably indexed the contents of that directory before you placed the deny from in the .htaccess. So in Google's mind they already have access to the links, etc. which were contained in that directory so they will still follow those links for a period of time until their system realizes that the "link" no longer exists because they no longer have access to index(on hold).html due to being blocked.

If that makes sense

It is never a good idea to leave old files with links hanging around. If they are not used and you have no intention of using them then remove them from the server.

Blocking robot's in the way you have chosen to is always going to be a tail chasing game. There are so many robot's out there and new ones come along all the time. Rather than trying to block each and everyone of them I would do things differently (it depends on your circumstance as to whether these will work for you or not).

1) Place an index.html file in the root of that folder with a meta redirect to the root of the site. This will prevent any robot's from seeing the contents of the directory but on the downside will also prevent directory listing for users too.

2) Add a .htaccess / .htpasswd configuration on that directory. Only give the password to people you want to access that folder. Google, nor any other robots will be able to access the directory and all indexing attempts by them will result in not authorized errors.

HTH

MartinPrestovic, Apr 27, 2011 IP

Tritontrax Peon

Messages:: 23

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

Since it sounds like you don't really want your site to be found by people who aren't invited, you might want to just block all crawlers from everywhere with the following robots.txt in your site root.

User-agent: *
Disallow: /

Any well-behaved robot that respects robots.txt (all of the search engine crawlers fall under this category) will stop indexing your site going forward.

Tritontrax, Apr 28, 2011 IP

bgbevan Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Many thanks to MartinPrestovic and Tritontrax for their suggestions. Both good ways to solve this problem. I have implemented passwords with a previous project so I am familiar with this technique. I will also try the solution proposed by Tritontrax. The one burning question I still have is even with the .htaccess "protection", the webcrawlers seemed to be able to still access the link with the "Get...." directive. If you look at the original access_log, the entry:

66.249.71.102 - - [21/Apr/2011:19:55:33 -0700] "GET /manual/ko/mod/mod_alias.html HTTP/1.1" 200 20639

the code 200 indicates it is a valid access and the 20639 indicates the byte count of the return. When I first experimented by blocking access from my own IP address, I received the "404 Forbidden" in the browser page, and the access_log showed the 404 code. I am not trying to beat this to death, but it appears to me that somehow the webcrawler can gain access to information through a link on/in the page, even though they may not be able to access the page itself. Does this make any sense?

bgbevan, May 2, 2011 IP

Log in or Sign up

Keeping The Web Crawlers Out

bgbevan Peon

MartinPrestovic Peon

Tritontrax Peon

bgbevan Peon

Useful Searches