Keeping The Web Crawlers Out

Discussion in 'Apache' started by bgbevan, Apr 23, 2011.

  1. #1
    I would be considered an Apache "novice hobbyist" and a Mac semi-pro. I wanted to use the OS X web sharing to transfer files that were too large to email. I set up all the linkage and a route into the web server through Port 81. (This was an exercise all in itself.) I disabled "index.html" so users would just see the contents of the directory. I built a .htaccess file in case I needed to block any crawlers or other nefarious interlopers. I modified the "httpd.conf" to allow separate instructions from .htaccess. I tested the .htaccess file by blocking a query from my own site and got a "Forbidden....." message so assumed it was working correctly. Everything worked fine as far as access and transferring of the files. In this example, I was transferring an MP4 video through the web server.

    ls -a /Library/WebServer/Documents
    .
    ..
    .DS_Store
    .htaccess
    Forget You.m4v
    PoweredByMacOSX.gif
    PoweredByMacOSXLarge.gif
    index(on hold).html

    Then I started to get hits from a Google webcrawler from address 66.249.71.xx. I couldn't figure out what "GET /manual/...." was accessing but I put a "deny from" in the .htaccess file.

    /Library/WebServer/Documents/.htaccess
    order allow,deny
    # Google crawler
    deny from 66.249
    # Google crawler
    deny from 66.249.71.102
    # ??
    deny from 123.125.71.35
    allow from all

    But the hits kept coming and I finally figured it was hitting the "Apache Manual" hyperlink inside the "index(on hold).html" file.

    /var/log/apache/access_log
    66.249.71.102 - - [21/Apr/2011:19:55:33 -0700] "GET /manual/ko/mod/mod_alias.html HTTP/1.1" 200 20639
    66.249.71.102 - - [21/Apr/2011:21:12:23 -0700] "GET /manual/es/ko/howto/htaccess.html HTTP/1.1" 301 261
    66.249.71.102 - - [21/Apr/2011:21:12:23 -0700] "GET /manual/ko/howto/htaccess.html HTTP/1.1" 200 18054
    66.249.71.102 - - [21/Apr/2011:22:32:00 -0700] "GET /manual/es/ko/mod/mod_cache.html HTTP/1.1" 301 260
    66.249.71.102 - - [21/Apr/2011:22:32:01 -0700] "GET /manual/ko/mod/mod_cache.html HTTP/1.1" 200 22883

    I thought .htaccess would protect all the downstream access. What mechanism is at work here? Does the "/" in the path "/manual..." mean from the current location downwards? Does this mean you can't leave hyperlinks laying around in your web server files and folder(s)?

    Bill Bevan
     
    Last edited: Apr 24, 2011
    bgbevan, Apr 23, 2011 IP
  2. MartinPrestovic

    MartinPrestovic Peon

    Messages:
    213
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #2
    The .htaccess file will only block the robots in it's current path and below. So in your case:

    /Library/WebServer/Documents/

    Would be blocked and anything below that, so if you set up:

    /Library/WebServer/Documents/Test/

    That would also be blocked.

    The problem is that Google probably indexed the contents of that directory before you placed the deny from in the .htaccess. So in Google's mind they already have access to the links, etc. which were contained in that directory so they will still follow those links for a period of time until their system realizes that the "link" no longer exists because they no longer have access to index(on hold).html due to being blocked.

    If that makes sense :)

    It is never a good idea to leave old files with links hanging around. If they are not used and you have no intention of using them then remove them from the server.

    Blocking robot's in the way you have chosen to is always going to be a tail chasing game. There are so many robot's out there and new ones come along all the time. Rather than trying to block each and everyone of them I would do things differently (it depends on your circumstance as to whether these will work for you or not).

    1) Place an index.html file in the root of that folder with a meta redirect to the root of the site. This will prevent any robot's from seeing the contents of the directory but on the downside will also prevent directory listing for users too.

    2) Add a .htaccess / .htpasswd configuration on that directory. Only give the password to people you want to access that folder. Google, nor any other robots will be able to access the directory and all indexing attempts by them will result in not authorized errors.

    HTH
     
    MartinPrestovic, Apr 27, 2011 IP
  3. Tritontrax

    Tritontrax Peon

    Messages:
    23
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Since it sounds like you don't really want your site to be found by people who aren't invited, you might want to just block all crawlers from everywhere with the following robots.txt in your site root.

    User-agent: *
    Disallow: /

    Any well-behaved robot that respects robots.txt (all of the search engine crawlers fall under this category) will stop indexing your site going forward.
     
    Tritontrax, Apr 28, 2011 IP
  4. bgbevan

    bgbevan Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Many thanks to MartinPrestovic and Tritontrax for their suggestions. Both good ways to solve this problem. I have implemented passwords with a previous project so I am familiar with this technique. I will also try the solution proposed by Tritontrax. The one burning question I still have is even with the .htaccess "protection", the webcrawlers seemed to be able to still access the link with the "Get...." directive. If you look at the original access_log, the entry:

    66.249.71.102 - - [21/Apr/2011:19:55:33 -0700] "GET /manual/ko/mod/mod_alias.html HTTP/1.1" 200 20639

    the code 200 indicates it is a valid access and the 20639 indicates the byte count of the return. When I first experimented by blocking access from my own IP address, I received the "404 Forbidden" in the browser page, and the access_log showed the 404 code. I am not trying to beat this to death, but it appears to me that somehow the webcrawler can gain access to information through a link on/in the page, even though they may not be able to access the page itself. Does this make any sense?
     
    bgbevan, May 2, 2011 IP