I've recently been getting "spam quotes" submitted from an insurance website that I deal with. Basically the quote forms are supposed to be for people looking for insurance quotes, but lately I've been getting tons of quote requests sent to my email from the site with "Viagara" or "buy Viagara" as the person's name. They then of course put a link to some site I'm supposed to click on (which I don't). It's driving me nuts, and I'm guessing it must be a robot since the visitor doesn't show up AT ALL on Google Analytics! In order for me to get the quote, someone has to fill in the form first, and hit submit. Is there a way to stop this spam without also blocking the SE's from crawling my site?
Spam emails originate from lists of email addresses. Those lists are typically compiled by email harvester robots that ruthlessly examine every site they can find for any embedded wmail addresses. The best approach, at least in the long term, is to stop harvesters; that won't get you off current lists, but I suspect those don't remain in use for all that long. Stopping harvesters used to be problematic: tricks like IP blocking are almost useless because such scum change their domains more often than their underwear. Fortunately, there is a handy trick, which is based on the reality that spam harvesters, unlike civilized searchbots, will invariably hammer any site they're scanning (by which I mean they will take pages as fast as the server can deliver them). Somebody--and I wish I could now recall who, to give credit where it's due--sometime ago devised a very clever little php script that can be included into any any .php (or, I believe, .shtml) web page that will stop spammers. I use it on all my sites, and I include the code, which is annotated, below: <?php // CONTROL PARAMETERS: $itime=5; // minimum *average* number of seconds between one-visitor visits $imaxvisit=12; // maximum visits allowed in ($itime x $imaxvisit seconds, e.g. 5 x 12 = 60 seconds) $ipenalty=10; // minutes before visitor is allowed back $logging='no'; // controls whether a log of attackers will be kept /* Notes... * $itime is the minimum number of seconds between visits _on average_ over $itime*$imaxvisit seconds. So, in the example, a visitor isn't blocked if it visits the script multiple times in the first 5 seconds, as long as it doesn't visit more than 60 times within 300 seconds (5 minutes). * If the limit is reached, $ipenalty is the number of minutes a visitor has to wait before being allowed back. An MD5 hash is made of each visitor's IP address, and the last 3 hex digits of that hash are used to generate one of a possible 4,096 filenames. If it is a new visitor, or a visitor who hasn't been seen for a while, the timestamp of the file is set to the then-current time; otherwise, it must be a recent visitor, and the time stamp is increased by $itime. If the visitor starts loading the timer script more rapidly than $itime seconds per visit, the time stamp on the IP-hashed filename will be increasing faster than the actual time is increasing. If the time stamp gets too far ahead of the current time, the visitor is branded a bad visitor and the penalty is applied by increasing the time stamp on its file even further. 4,096 separate hash files is enough that it's very unlikely you'll get two visitors at exactly the same time with the same hash, but not so many that you need to keep tidying up the files. (Even if you do get more than one visitor with the same hash file at the same time, it's no great disaster: they'll just approach the throttle limit a little faster, which in most cases won't matter, as the limits are quite generous.) Note: This will NOT neatly stop a bot from taking more than X files in Y seconds; its action depends on the difference between the allowed and the actual rate. Bots taking only a bit more than allowed can run for quite some time--only really fast ones will be stopped quickly. But that is acceptable behavior: the worse they offend, the faster they're stopped; the less they offend, the longer they have. This script assumes that there are two subdirectories off whatever directory it itself is housed in: /logs and /trapfiles The use of logs is self-evident; the use of trapfiles is as home to the transient files used as timers. There is nothing magic about that arrangement, and you can change it in the script below if you take care with what you are doing. But there must be *some* home for the generated logs and timer files. (The logging is not essential, and you can eliminate it if you like.) */ // INITIALIZATIONS: // Set Flush: ob_implicit_flush(TRUE); // Constants: // See if WIN: $windoze=FALSE; if (strtoupper(substr(PHP_OS,0,3))=='WIN') $windoze=TRUE; // General: $blank=' '; $crlf=chr(13).chr(10); $br='<br />'.$crlf; $p='<br /><br />'.$crlf; $slash='/'; $localslash='/'; if ($windoze===TRUE) $localslash='\\'; $oneq="'"; $twoq='"'; $dot='.'; // Logging: $logit=FALSE; if (substr(strtolower($logging),0,1)==='y') $logit=TRUE; // Directories/Files: // main directory: $timerpath=dirname(realpath(__FILE__)).$localslash; // e.g /usr/www/users/ewalker/seo-toys/bookadder/ // subdirectories: $timerlogspath=$timerpath.'logs/'; $iplogfile=$timerlogspath.'ErrantIPs.Log'; $timertrapspath=$timerpath.'trapfiles/'; // OPERATION: // Bail A/R: if (is_dir($timertrapspath)===FALSE) return; // can't do squat // Free Passes: $freeps=file($timerpath.'freepasses.php'); if ($freeps===FALSE) $freeps=NULL; $freeps[]=$_SERVER["SERVER_ADDR"]; // self always foreach ($freeps as $freep) {if ($_SERVER["REMOTE_ADDR"]==$freep) exit;} // Make Check: // Get file time: $ipfile=$timertrapspath.substr(md5($_SERVER["REMOTE_ADDR"]),-3); // -3 means 4096 possible files $oldtime=0; if (file_exists($ipfile)===TRUE) $oldtime=filemtime($ipfile); // Update times: $time=time(); if ($oldtime<$time) $oldtime=$time; // not allowed to fall behind $newtime=$oldtime+$itime; // apparently incremented even on 1st hit $allowed=$itime*$imaxvisit; // maximum amount filetime can lead current real time // Stop overuser: if ($newtime>=$time+$allowed) { // block visitor: touch($ipfile,$time+$itime*($imaxvisit-1)+(60*$ipenalty)); header("HTTP/1.0 503 Service Temporarily Unavailable"); header("Retry-After: '.60*$ippenalty.'"); header("Connection: close"); header("Content-Type: text/html"); echo '<html>'.$crlf; echo ' <head>'.$crlf; echo ' <title>Overload Warning</title>'.$crlf; echo ' </head>'.$crlf; echo '<body>'.$crlf; echo '<p align="center"><strong>The Server is momentarily under heavy load.</strong>'.$br; echo 'Please wait at least '.$ipenalty.' minutes before trying again.</p>'.$crlf; echo '</body>'.$crlf; echo '</html>'.$crlf; // log occurrence: if ($logit===TRUE) { $spamlog=@fopen($iplogfile,"a"); if ($spamlog!==FALSE) { $useragent='<unknown user agent>'; if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent=$_SERVER["HTTP_USER_AGENT"]; @fputs($spamlog,str_pad('# '.$useragent,135).str_pad(' Deny From '.$_SERVER["REMOTE_ADDR"],28).date("D, Y M d, H:i:s").$crlf); } @fclose($spamlog); } exit(); // done for this page } // tripped limit // Modify file time: touch($ipfile,$newtime); ?> Code (markup): Note that logging is optional but, if wanted, needs a subdirectory for the log files. There is also a needed subdirectory for storing the hashes. There is also allowance for a file named freepasses.php, which can contain a set of IP Addresses for which the timer is not applied. The script needs to be included in every user-reachable file, or at least any that you don't want being harvested. Finally, note that if you do enable logging, the log size blows up pretty quickly, because there are a lot of harvesters at work out there. What i recommend is that you let it log for a week or two and see what the logs look like. You might be surprised to occasionally see legitimate searchbots in it: that's because even the best of them lie about how careful they are, and do indeed on occasion hammer sites. (Most of the big ones, except Google itself, will honor a Crawl-delay: directive in robots.txt, and I recommend every robots.txt include one with a setting of 5 (seconds). For Google, if you are enrolled in Google Webmaster Tools (and if not, why in Heaven's name not?) you can check the Settings for your site and see (and, sometimes,depending, adjust) what G claims it is doing about hit rates on your site. (Note: I tried to remove from the script anything particular to my usage of it, but frankly was in a hurry; i think it's vanilla as is, but look it over. You should do that anyway with any sort of code before implementing it.)
use image verification code while submitting any form to database. There too many free services online for the same.