1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Blocking Spam without Blocking Search Engines?

Discussion in 'Search Engine Optimization' started by long island insurance, Mar 8, 2009.

  1. #1
    I've recently been getting "spam quotes" submitted from an insurance website that I deal with. Basically the quote forms are supposed to be for people looking for insurance quotes, but lately I've been getting tons of quote requests sent to my email from the site with "Viagara" or "buy Viagara" as the person's name. They then of course put a link to some site I'm supposed to click on (which I don't).

    It's driving me nuts, and I'm guessing it must be a robot since the visitor doesn't show up AT ALL on Google Analytics! In order for me to get the quote, someone has to fill in the form first, and hit submit.

    Is there a way to stop this spam without also blocking the SE's from crawling my site?
     
    long island insurance, Mar 8, 2009 IP
  2. internetmarketingiq

    internetmarketingiq Well-Known Member

    Messages:
    3,552
    Likes Received:
    70
    Best Answers:
    0
    Trophy Points:
    165
    #2
    To stop bots you'll need to incorporate a captcha into your script.
     
    internetmarketingiq, Mar 8, 2009 IP
  3. audiblemarketing

    audiblemarketing Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    You should definitely incorporate a form that requires typing in a captcha.

    Lisa
     
    audiblemarketing, Mar 8, 2009 IP
  4. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Spam emails originate from lists of email addresses. Those lists are typically compiled by email harvester robots that ruthlessly examine every site they can find for any embedded wmail addresses. The best approach, at least in the long term, is to stop harvesters; that won't get you off current lists, but I suspect those don't remain in use for all that long.

    Stopping harvesters used to be problematic: tricks like IP blocking are almost useless because such scum change their domains more often than their underwear. Fortunately, there is a handy trick, which is based on the reality that spam harvesters, unlike civilized searchbots, will invariably hammer any site they're scanning (by which I mean they will take pages as fast as the server can deliver them).

    Somebody--and I wish I could now recall who, to give credit where it's due--sometime ago devised a very clever little php script that can be included into any any .php (or, I believe, .shtml) web page that will stop spammers. I use it on all my sites, and I include the code, which is annotated, below:

    <?php
    
      // CONTROL PARAMETERS:
    
      $itime=5;          // minimum *average* number of seconds between one-visitor visits
      $imaxvisit=12;     // maximum visits allowed in ($itime x $imaxvisit seconds, e.g. 5 x 12 = 60 seconds)
      $ipenalty=10;      // minutes before visitor is allowed back
      $logging='no';     // controls whether a log of attackers will be kept
    
    /*
    
    Notes...
    
        * $itime is the minimum number of seconds between visits _on average_ over 
          $itime*$imaxvisit seconds.  So, in the example, a visitor isn't blocked if 
          it visits the script multiple times in the first 5 seconds, as long as 
          it doesn't visit more than 60 times within 300 seconds (5 minutes).
    
        * If the limit is reached, $ipenalty is the number of minutes a visitor has to 
          wait before being allowed back. 
    
    An MD5 hash is made of each visitor's IP address, and the last 3 hex digits of that hash 
    are used to generate one of a possible 4,096 filenames.  If it is a new visitor, or a visitor
    who hasn't been seen for a while, the timestamp of the file is set to the then-current time; 
    otherwise, it must be a recent visitor, and the time stamp is increased by $itime. 
    
    If the visitor starts loading the timer script more rapidly than $itime seconds per visit,
    the time stamp on the IP-hashed filename will be increasing faster than the actual time is 
    increasing.  If the time stamp gets too far ahead of the current time, the visitor is branded 
    a bad visitor and the penalty is applied by increasing the time stamp on its file even 
    further.
    
    4,096 separate hash files is enough that it's very unlikely you'll get two visitors at exactly 
    the same time with the same hash, but not so many that you need to keep tidying up the files.
    
    (Even if you do get more than one visitor with the same hash file at the same time, it's 
    no great disaster: they'll just approach the throttle limit a little faster, which in most 
    cases won't matter, as the limits are quite generous.)
    
    Note: This will NOT neatly stop a bot from taking more than X files in Y seconds; its
    action depends on the difference between the allowed and the actual rate.  Bots taking
    only a bit more than allowed can run for quite some time--only really fast ones will be
    stopped quickly.  But that is acceptable behavior: the worse they offend, the faster
    they're stopped; the less they offend, the longer they have.
    
    This script assumes that there are two subdirectories off whatever directory it itself is
    housed in: /logs and /trapfiles
    
    The use of logs is self-evident; the use of trapfiles is as home to the transient files
    used as timers.  There is nothing magic about that arrangement, and you can change it in
    the script below if you take care with what you are doing.  But there must be *some* home
    for the generated logs and timer files.
    
    (The logging is not essential, and you can eliminate it if you like.)
    
    */
    
    
      // INITIALIZATIONS:
    
      //   Set Flush:
      ob_implicit_flush(TRUE);
    
      //   Constants:
    
      //     See if WIN:
      $windoze=FALSE;
      if (strtoupper(substr(PHP_OS,0,3))=='WIN') $windoze=TRUE;
    
      //     General:
      $blank=' ';
      $crlf=chr(13).chr(10);
      $br='<br />'.$crlf;
      $p='<br /><br />'.$crlf;
      $slash='/';
      $localslash='/';
      if ($windoze===TRUE) $localslash='\\';
      $oneq="'";
      $twoq='"';
      $dot='.';
    
      //     Logging:
      $logit=FALSE;
      if (substr(strtolower($logging),0,1)==='y') $logit=TRUE;
    
      //     Directories/Files:
    
      //       main directory:
      $timerpath=dirname(realpath(__FILE__)).$localslash;  //   e.g   /usr/www/users/ewalker/seo-toys/bookadder/
    
      //       subdirectories:
      $timerlogspath=$timerpath.'logs/';
      $iplogfile=$timerlogspath.'ErrantIPs.Log';
      $timertrapspath=$timerpath.'trapfiles/';
    
    
      // OPERATION:
    
      //   Bail A/R:
      if (is_dir($timertrapspath)===FALSE) return;  // can't do squat
    
      //   Free Passes:
      $freeps=file($timerpath.'freepasses.php');
      if ($freeps===FALSE) $freeps=NULL;
      $freeps[]=$_SERVER["SERVER_ADDR"];  // self always
      foreach ($freeps as $freep)
      {if ($_SERVER["REMOTE_ADDR"]==$freep) exit;}
    
      //   Make Check:
    
      //     Get file time:
      $ipfile=$timertrapspath.substr(md5($_SERVER["REMOTE_ADDR"]),-3);  // -3 means 4096 possible files
      $oldtime=0;
      if (file_exists($ipfile)===TRUE) $oldtime=filemtime($ipfile);
    
      //     Update times:
      $time=time();
      if ($oldtime<$time) $oldtime=$time;  // not allowed to fall behind
      $newtime=$oldtime+$itime;  // apparently incremented even on 1st hit
      $allowed=$itime*$imaxvisit;  // maximum amount filetime can lead current real time
    
      //     Stop overuser:
      if ($newtime>=$time+$allowed)
      {
        //     block visitor:
        touch($ipfile,$time+$itime*($imaxvisit-1)+(60*$ipenalty));
        header("HTTP/1.0 503 Service Temporarily Unavailable");
        header("Retry-After: '.60*$ippenalty.'");
        header("Connection: close");
        header("Content-Type: text/html");
        echo '<html>'.$crlf;
        echo ' <head>'.$crlf;
        echo '  <title>Overload Warning</title>'.$crlf;
        echo ' </head>'.$crlf;
        echo '<body>'.$crlf;
        echo '<p align="center"><strong>The Server is momentarily under heavy load.</strong>'.$br;
        echo 'Please wait at least '.$ipenalty.' minutes before trying again.</p>'.$crlf;
        echo '</body>'.$crlf;
        echo '</html>'.$crlf;
    
        //     log occurrence:
        if ($logit===TRUE)
        {
          $spamlog=@fopen($iplogfile,"a");
          if ($spamlog!==FALSE)
          {
            $useragent='<unknown user agent>';
            if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent=$_SERVER["HTTP_USER_AGENT"];
            @fputs($spamlog,str_pad('#   '.$useragent,135).str_pad(' Deny From '.$_SERVER["REMOTE_ADDR"],28).date("D, Y M d, H:i:s").$crlf);
          }
          @fclose($spamlog);
        }
    
        exit();  // done for this page
    
      }  // tripped limit
    
      //   Modify file time:
      touch($ipfile,$newtime);
    
    ?>
    Code (markup):
    Note that logging is optional but, if wanted, needs a subdirectory for the log files. There is also a needed subdirectory for storing the hashes. There is also allowance for a file named freepasses.php, which can contain a set of IP Addresses for which the timer is not applied.

    The script needs to be included in every user-reachable file, or at least any that you don't want being harvested.

    Finally, note that if you do enable logging, the log size blows up pretty quickly, because there are a lot of harvesters at work out there. What i recommend is that you let it log for a week or two and see what the logs look like. You might be surprised to occasionally see legitimate searchbots in it: that's because even the best of them lie about how careful they are, and do indeed on occasion hammer sites. (Most of the big ones, except Google itself, will honor a Crawl-delay: directive in robots.txt, and I recommend every robots.txt include one with a setting of 5 (seconds). For Google, if you are enrolled in Google Webmaster Tools (and if not, why in Heaven's name not?) you can check the Settings for your site and see (and, sometimes,depending, adjust) what G claims it is doing about hit rates on your site.

    (Note: I tried to remove from the script anything particular to my usage of it, but frankly was in a hurry; i think it's vanilla as is, but look it over. You should do that anyway with any sort of code before implementing it.)
     
    Owlcroft, Mar 8, 2009 IP
  5. SabQat

    SabQat Peon

    Messages:
    675
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #5
    use image verification code while submitting any form to database.

    There too many free services online for the same.
     
    SabQat, Mar 10, 2009 IP