1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Stopping/Discouraging SpamBots.

Discussion in 'robots.txt' started by Owlcroft, Jan 14, 2005.

  1. #1
    We all know the pain--and bandwidth cost, both figurative and literal--of spambots that suck up our sites, whether for scraping, email harvesting, or whatever.

    We all know also the various ways in which unwanted user-agents can, in principle, be stopped: robots.txt blocks by user-agent name, and .htaccess blocks by uer-agent name, or IP address. But those, while helpful, can scarcely do the whole job, inasmuch as they necessarily block particular user-agent names and particular IP addresses. But spammers can and do change those things a lot more often than they probably change their underwear. Using those tools is fighting WWII with the weapons of WWI.

    The ideal thing to do is to place controls based on actual bad behavior as it happens. I recently ran across a delightful and helpful thread on another forum that presented a clever solution using PHP. The gist of the thing is that it tracks visitor behavior, and visitors that are trying to download too many pages too fast are soon stopped with a 503 and a penalty time before they can load more pages; the time parameters are adjustable, and it can keep track of a sufficiency of simultaneous visitors with very little computational load.

    Here is a variant of the script as I have now installed it (with explanatory comments built it)--about all one needs to customize is the directory where the logfile is kept.

    <?php
    
    //   ENGLISH-LANGUAGE VERSION: 
    
    /*
    
    Notes...
    
        * $itime is the minimum number of seconds between visits _on average_ over 
          $itime*$imaxvisit seconds.  So in the example, a visitor isn't blocked 
          if it  visits the script multiple times in the first 5 seconds, as long
          as it doesn't visit more than 60 times within 300 seconds (5 minutes).
    
        * If the limit is reached, $ipenalty is the number of seconds a visitor
          has to wait before being allowed back. 
    
    An MD5 hash is made of each visitor's IP address, and the last 3 hex digits of that hash are used to generate one of a possible 4096 filenames.  If it is a new visitor, or a visitor who hasn't been seen for a while, the timestamp of the file is set to the then-current time; otherwise, it must be a recent visitor, and the time stamp is increased by $itime. 
    
    If the visitor starts loading the timer script more rapidly than $itime seconds per visit,the time stamp on the IP-hashed filename will be increasing faster than the actual time is increasing.  If the time stamp gets too far ahead of the current time, the visitor is branded a bad visitor and the penalty is applied by increasing the time stamp on its file even further.
    
    4096 separate hash files is enough that it's very unlikely you'll get two visitors at exactly the same time with the same hash, but not so many that you need to keep tidying up the files.
    
    (Even if you do get more than one visitor with the same hash file at the same time, it's no great disaster: they'll just approach the throttle limit a little faster, which in most cases won't matter, as the limits in the example--5/60/60--are quite generous.)
    
    This script can be simply included in each appropriate php script with this:
    
    
      //   Spam-Block:
      include('timer.inc');
    
    */
    
      // INITIALIZATIONS:
    
      //   Constants:
    
      //     Fixed:
      $crlf=chr(13).chr(10);
      $itime=5;  // minimum number of seconds between one-visitor visits
      $imaxvisit=60;  // maximum visits in $itime x $imaxvisits seconds
      $ipenalty=60;  // seconds before visitor is allowed back
      $iplogdir="../logs/";
      $iplogfile="ErrantIPs.Log";
    
      //     Language-dependent:
      $spammer1='The Server is momentarily under heavy load.';
      $spammer2='Please wait ';
      $spammer3=' seconds and try again.';
    
    
    
      // OPERATION:
    
      //   Make Check:
    
      //     Get file time:
      $ipfile=substr(md5($_SERVER["REMOTE_ADDR"]),-3);  // -3 means 4096 possible files
      $oldtime=0;
      if (file_exists($iplogdir.$ipfile)) $oldtime=filemtime($iplogdir.$ipfile);
    
      //     Update times:
      $time=time();
      if ($oldtime<$time) $oldtime=$time;
      $newtime=$oldtime+$itime;
    
      //     Stop overuser:
      if ($newtime>=$time+$itime*$imaxvisit)
      {
        //     block visitor:
        touch($iplogdir.$ipfile,$time+$itime*($imaxvisit-1)+$ipenalty);
        header("HTTP/1.0 503 Service Temporarily Unavailable");
        header("Connection: close");
        header("Content-Type: text/html");
        echo '<html><head><title>Overload Warning</title></head><body><p align="center"><strong>'
              .$spammer1.'</strong>'.$br;
        echo $spammer2.$ipenalty.$spammer3.'</p></body></html>'.$crlf;
        //     log occurrence:
        $fp=@fopen($iplogdir.$iplogfile,"a");
        if ($fp!==FALSE)
        {
          $useragent='<unknown user agent>';
          if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent=$_SERVER["HTTP_USER_AGENT"];
          @fputs($fp,$_SERVER["REMOTE_ADDR"].' on '.date("D, d M Y, H:i:s").' as '.$useragent.$crlf);
        }
        @fclose($fp);
        exit();
      }
    
      //     Modify file time:
      touch($iplogdir.$ipfile,$newtime);
    
    ?>
    PHP:
    This script alone seriously slows down spambots, so they can't suck wild amounts of bandwidth. But it also generates a log, which allows you to periodically put IP blocks in .htaccess for heavy or frequent would-be abusers.

    A second toy is an email-harvester trap. The script is simple:

    <?php
    
    
    //   makemail.php - create dynamic spurious email address: 
    
    
      //   "Constants":
    
      //     General:
      $blank=' ';
      $crlf=chr(13).chr(10);
      $br='<br />'.$crlf;
      $p='<br /><br />'.$crlf;
    
    
      //   Make Address:
    
      //     Get data:
      $referrer=trim($_SERVER['REMOTE_HOST']);
      $referrer=str_replace('.','_',$referrer);
      $at=date("d_m_y_H_i_s");
    
      //     Echo address:
      $fakedup=$referrer.'__'.$at;
      echo 'And this is a spammer-trapping spurious'.$crlf;
      echo '<a href="mailto:'.$fakedup.'@mywonderfulsite.com>email</a>'
           .' address.)'.$crlf;
    
    ?>
    PHP:
    You call that script from any shtml file with a simple:

    <p align="center"><font color="#cccccc" size="1">
    (Do <em><strong>not</strong></em> click here: this is a
    <a href="http://mywonderfulsite.com/spweb1.php">false</a> link to catch evil web robots:
    anything or anyone visiting that link will be barred from this site.
    <br />
    <!--#include virtual="/makemail.php" -->
    </font></p>
    You phrase it and style it, of course, to your exact taste.

    The script generates an ad hoc email address that contains the IP address of the thief, plus the exact date and time of the theft. You need, of course, to configure your email software to direct emails addressed in that form to a particular mailbox. The timer script slows down their harvesting--possibly stopping it, I don't know how smart harvester software is about not wasting its time--but this also gets the IP of the thief. Thus, if you ever get a spam email to that address, you have the IP address and the time of the email-address theft, which you can use to explicitly block that thief (so long as it sticks to that address), and moreover the email may be sufficient evidence (in conjunction with the rest of the facts) for prosecution in those jurisdictions that allow suing spammers (I am in one, Washington State, and will see what happens if I get an IP I can track to a person or business entity).

    The HTML block shown above also includes a link to a third toy, SpiderWeb, a php script that looks like this (again, only the log directory needs customizing):

    <?php
    
    //   spweb1.php - Spider Trapper #1 
    
    
      //   "Constants":
    
      //     General:
      $blank=' ';
      $crlf=chr(13).chr(10);
      $br='<br />'.$crlf;
      $p='<br /><br />'.$crlf;
    
      //     Particular:
      $logdir='logs/';
      $logfile=$logdir.'Trap1.Log';
    
    
      //   Loop Them:
      header('Location: http://www.hostedscripts.com/scripts/antispam.html')'
    
    
      //   Log Call:
    
      //     Get data:
      $referrer=trim($_SERVER['REMOTE_HOST']);
      if ($referrer==NULL) $referrer='unspecified referrer';
      $address=trim($_SERVER['REMOTE_ADDR']);
      if ($address==NULL) $address='unspecified address';
      $agent=trim($_SERVER['HTTP_USER_AGENT']);
      if ($agent==NULL) $agent='unspecified agent';
      $query=trim($_SERVER["QUERY_STRING"]);
      if ($query==NULL) $query='no query';
      $msg=$referrer.$crlf
          .'   '.$address.$crlf
          .'   '.$agent.$crlf
          .'   '.$query.$crlf
          .'     '.date('l, j F Y, H:i:s',time()-10800).$crlf
          .$crlf;
    
      //     Log data:
      $lhandle=fopen($logfile,'a');
       fwrite($lhandle,$msg.$crlf);
      fclose($lhandle);
    
    
    ?>
    PHP:
    The essence of the trap is that spweb1.php (like makemail.php) is to be blocked in your robots.txt file. Any IP address it logs is a user-agent that ignored your robots.txt file.

    (I have a virtually identical spweb2.php that is not linked anywhere: it is named only in the robots.txt file--so any user-agent caught by that trap actually harvests blocked files from robots.txt. There is no need to keep the two kinds of creeps segregated, but I like to be able to see which was which.)

    The 302-redirect link is to a neat harvester-poisoning site page, which you can go to and inspect for yourself. The thieves you send there will love it . . . .

    I put these forth as probably useful, but more as starting points so that others can gin up their own flavors. The essence, again, is to stop (or, at any rate, very seriously slow down) bots in the act, based purely on their actual behavior as seen in real time.
     
    Owlcroft, Jan 14, 2005 IP
    ResaleBroker likes this.
  2. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #2
    As usual well written - when I get a minute over the weekend I will digest

    As it happens I had one site get bandwidth sucked hugely recently and I struggled to find ways to keep up

    Thanks this may give me the solution
     
    Foxy, Jan 14, 2005 IP
  3. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #3
    In fact, once one has used this stuff for a while, and so developed confidence in it, one could stop manually transferring "bad bot" IPs to .htaccess and have the PHP script rewrite .htaccess on the fly. I am still in the watch-and-check stage, though it has already nicely caught four bad bots in perhaps one full day, but dynamic rewriting is what I plan to eventually do.

    The email trap has so far caught nine spurious emails, though unfortunately all were generated by an earlier trap that didn't have the IP-catch feature, so the spammers could claim they just bought the addresses or some such (and perhaps they did). As soon as I have a couple of IP-named spam emails, I will look into suing the spam harvesters. (As best I recall without looking, I believe Washington State's laws provide a $500 per spam fine.)
     
    Owlcroft, Jan 14, 2005 IP
  4. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Now that I have used this toy myself for a while, with success, I think a couple fo things will improve it slightly.

    First, on my own system I have raised the penalty time to, for now, 180 seconds; I find that I am trapping only real spammers, so feel that is safe, and will slow them down further (I may even make it 300 seconds soon now).

    Second, to deal with situations where the site pages are in many directories and subdirectories, the definitions should be changed to--

    $trapdir=$_SERVER["DOCUMENT_ROOT"].'/traplogs/';
    $iplogfile=$_SERVER["DOCUMENT_ROOT"].'/logs/ErrantIPs.Log';
    --where you can plug in your own paths and names. The example shown above uses a single subdirectory right off the root to hold the up to 4,096 timer files, and a parallel directory to hold the logfile. The logfile can be put wherever is convenient for you, including some existing log directory, but it is tedious to have it in the same directory as the timer files, which should have a directory all their own.

    The PHP variable _SERVER["DOCUMENT_ROOT"] just specifies the root directory of your site as an absolute path on your server. Thus--

    $_SERVER["DOCUMENT_ROOT"].'/traplogs/'

    --would parse out to something like--
    usr/home/yourname/public_html/traplogs/
    --which is just that directory's actual path on your host server. (The HTTP URI equivalent would be http://www.mywonderfulsite.com/traplogs/).

    The entire PHP script, which I suggest be named timer.inc or spamtrap.inc or some such, can then be placed in some one location--I use the root directory--and thus called with an absolute location. In PHP scripts, I use--

    include($_SERVER["DOCUMENT_ROOT"].'/timer.inc');
    --and in shtml files I use--

    <!--#include virtual="/timer.inc" -->
    I hope this helps better implement this script, which I have found very pleasing to use. (And, again, it is not of my original devising.)
     
    Owlcroft, Jan 17, 2005 IP
  5. ResaleBroker

    ResaleBroker Active Member

    Messages:
    1,665
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    90
    #5
    I noticed in your most recent post you had changed:

    $iplogdir="../logs/";

    to

    $trapdir=$_SERVER["DOCUMENT_ROOT"].'/traplogs/';

    Is that supposed to be "$iplogdir"?


    I also noticed:

    $iplogdir="../logs/";
    $iplogfile="ErrantIPs.Log";

    // Modify file time:
    touch($iplogdir.$ipfile,$newtime);

    Is that supposed to be "$iplogfile"?
     
    ResaleBroker, Jan 17, 2005 IP
  6. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Yeah--I had changed the names in my own files, and when I posted, forgot to make them match the prior post.

    Sorry for the confusion.
     
    Owlcroft, Jan 18, 2005 IP
  7. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Having used this script for a while now, I see an immense reduction in stolen bandwidth, so it is helping.

    But I also find that the scum are persistent, so I have set my own parameters, and recommend these to others, as:

    $itime=5;
    $imaxvisit=12;
    $ipenalty=600;
    That means that if any visitor, human or robot, takes more than 12 files in 60 seconds (5*12), they will be blocked from taking more files (via a "403 Forbidden" block) for 10 minutes (600 seconds).

    (Recall that $itime is the desired average time between hits, and it is calculated over $itime*$imaxvisit seconds.)

    If the typical scum/harvester take rate of a hit a second is too much for your server over even a minute, you could even tighten it up to--

    $itime=5;
    $imaxvisit=6;
    $ipenalty=600;
    --which would bring the penalty for anything more than 6 hits in any 30 seconds.

    I urgently recommend against raising the $time above 5 seconds--most or all of the well-behaved bots (notably Google's) promise to hit no more often than that, and for the dumbos (M$ and Yahoo), you can--and should--put

    Crawl-delay: 5
    in your robots.txt file, after which no bot has any cause to hit more often once every 5 seconds.

    Incidentally, I noticed the other day a cute bastard: it tried six different user-agent identities in as many seconds--obviously trying to get around by-name user-agent blocks. I have always thought those were silly, and that proves it. But even blocking by IP address--which I do when I detect one of these scumballs--is only a partial answer. We need to stop them dead in their tracks as they commit their acts, which is why I love this script (which, again, is not of my making.)

    If anyone cares, the log of now-blocked offenders so far is this:

    # EmeraldShield.com WebBot ..... 24.227.118.54
    # Microsoft URL Control - 6.01.97..... 82.50.206.162
    # Microsoft URL Control - 6.00.8862..... 213.140.17.98
    # <unknown user agent>..... 217.224.211.130
    # Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) ..... 68.50.44.149
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)..... 24.248.170.119
    # Mozilla/3.Mozilla/2.01 (Win95; I)..... 68.158.5.125
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)..... 81.50.156.226
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)..... 80.200.107.134
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ..... 209.121.86.22
    # MSIE5.5..... 80.46.67.1
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)..... 64.42.105.59
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)..... 81.225.107.149
    # CydralSpider/1.8 (Cydral Web Image Search; http://www.cydral.com)..... 213.246.63.116
    # Web Downloader/6.5..... 200.204.63.24
    # RPT-HTTPClient/0.3-3..... 194.255.110.3
    # Mozilla/2.0 (compatible; MSIE 3.02; Win32) ..... 208.53.138.124
    # Mozilla/3.0 (Win95; I; 16bit)..... 208.53.138.124
    # Mozilla/3.01 (Macintosh; I; 68K)..... 208.53.138.124
    # Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) ..... 208.53.138.124
    # Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 5.1) Opera 5.12 [en]..... 208.53.138.124
    # Mozilla/4.7 (Macintosh; I; PPC)..... 208.53.138.124​
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)..... 213.42.2.25
    # Mozilla/5.0 (Macintosh; U; PPC;) Gecko DEVONtech..... 66.143.176.16
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)..... 195.229.241.188
    # <unknown user agent> ..... 198.87.84.185
    # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)..... 64.242.1.3
    # User-Agent: Mozilla/4.0 (http://www.fast-search-engine.com/)..... 65.98.67.74
     
    Owlcroft, Jan 28, 2005 IP
  8. ResaleBroker

    ResaleBroker Active Member

    Messages:
    1,665
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    90
    #8
    Like this?
    
    
    User-agent: Slurp 
    Crawl-delay: 5
     
    User-Agent: msnbot
    crawl-delay: 5
    
    Code (markup):
     
    ResaleBroker, Jan 28, 2005 IP
  9. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #9
    I just use--

    User-agent: *
    Crawl-delay: 5

    Supposedly at least all of the Big 3 SEs "know" that directive.
     
    Owlcroft, Jan 28, 2005 IP
    ResaleBroker likes this.
  10. ResaleBroker

    ResaleBroker Active Member

    Messages:
    1,665
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    90
    #10
    Keep'n it Simple. :)
     
    ResaleBroker, Jan 28, 2005 IP
  11. TheAdMan

    TheAdMan Peon

    Messages:
    35
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    On google sitemaps I noticed that Crawl-Delay: 5 is not recognized by googlebot?
    Any other suggestions to delay crawlers.?

    I found this out after submitting my site to google sitemaps.
     
    TheAdMan, Aug 18, 2006 IP
  12. reese

    reese Active Member

    Messages:
    156
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #12
    i just came across this script and omgz thanks alot!! it works like a charm
     
    reese, Jan 28, 2008 IP
  13. Dolbz

    Dolbz Peon

    Messages:
    24
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Rather than having to worry about whether your crawl delay is really working couldn't you modify the script to whitelist the major SE's? Whitelisting their user agent wouldn't be a good idea as anyone could pretend they're googlebot but I'm guessing all googlebots will resolve back to google.com so doing a reverse lookup could achieve this.
    Obviously all these lookups is a bad idea but you only need to do it on encountering a new 'Googlebot' IP and log the known friendlies.

    edit:
    after a quick check it seems Googlebot resolves to googlebot.com so should work nicely...
     
    Dolbz, Jan 29, 2008 IP
  14. Quench

    Quench Active Member

    Messages:
    449
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    80
    #14


    hi are you able to tell me which file this code belongs in please? and whether it goes in the top or bottom ect/
     
    Quench, Aug 7, 2008 IP