Owlcroft
Jan 14th 2005, 3:41 am
We all know the pain--and bandwidth cost, both figurative and literal--of spambots that suck up our sites, whether for scraping, email harvesting, or whatever.
We all know also the various ways in which unwanted user-agents can, in principle, be stopped: robots.txt blocks by user-agent name, and .htaccess blocks by uer-agent name, or IP address. But those, while helpful, can scarcely do the whole job, inasmuch as they necessarily block particular user-agent names and particular IP addresses. But spammers can and do change those things a lot more often than they probably change their underwear. Using those tools is fighting WWII with the weapons of WWI.
The ideal thing to do is to place controls based on actual bad behavior as it happens. I recently ran across a delightful and helpful thread on another forum (http://www..webmasterworld.com/forum88/119.htm) that presented a clever solution using PHP. The gist of the thing is that it tracks visitor behavior, and visitors that are trying to download too many pages too fast are soon stopped with a 503 and a penalty time before they can load more pages; the time parameters are adjustable, and it can keep track of a sufficiency of simultaneous visitors with very little computational load.
Here is a variant of the script as I have now installed it (with explanatory comments built it)--about all one needs to customize is the directory where the logfile is kept.
<?php
// ENGLISH-LANGUAGE VERSION:
/*
Notes...
* $itime is the minimum number of seconds between visits _on average_ over
$itime*$imaxvisit seconds. So in the example, a visitor isn't blocked
if it visits the script multiple times in the first 5 seconds, as long
as it doesn't visit more than 60 times within 300 seconds (5 minutes).
* If the limit is reached, $ipenalty is the number of seconds a visitor
has to wait before being allowed back.
An MD5 hash is made of each visitor's IP address, and the last 3 hex digits of that hash are used to generate one of a possible 4096 filenames. If it is a new visitor, or a visitor who hasn't been seen for a while, the timestamp of the file is set to the then-current time; otherwise, it must be a recent visitor, and the time stamp is increased by $itime.
If the visitor starts loading the timer script more rapidly than $itime seconds per visit,the time stamp on the IP-hashed filename will be increasing faster than the actual time is increasing. If the time stamp gets too far ahead of the current time, the visitor is branded a bad visitor and the penalty is applied by increasing the time stamp on its file even further.
4096 separate hash files is enough that it's very unlikely you'll get two visitors at exactly the same time with the same hash, but not so many that you need to keep tidying up the files.
(Even if you do get more than one visitor with the same hash file at the same time, it's no great disaster: they'll just approach the throttle limit a little faster, which in most cases won't matter, as the limits in the example--5/60/60--are quite generous.)
This script can be simply included in each appropriate php script with this:
// Spam-Block:
include('timer.inc');
*/
// INITIALIZATIONS:
// Constants:
// Fixed:
$crlf=chr(13).chr(10);
$itime=5; // minimum number of seconds between one-visitor visits
$imaxvisit=60; // maximum visits in $itime x $imaxvisits seconds
$ipenalty=60; // seconds before visitor is allowed back
$iplogdir="../logs/";
$iplogfile="ErrantIPs.Log";
// Language-dependent:
$spammer1='The Server is momentarily under heavy load.';
$spammer2='Please wait ';
$spammer3=' seconds and try again.';
// OPERATION:
// Make Check:
// Get file time:
$ipfile=substr(md5($_SERVER["REMOTE_ADDR"]),-3); // -3 means 4096 possible files
$oldtime=0;
if (file_exists($iplogdir.$ipfile)) $oldtime=filemtime($iplogdir.$ipfile);
// Update times:
$time=time();
if ($oldtime<$time) $oldtime=$time;
$newtime=$oldtime+$itime;
// Stop overuser:
if ($newtime>=$time+$itime*$imaxvisit)
{
// block visitor:
touch($iplogdir.$ipfile,$time+$itime*($imaxvisit-1)+$ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo '<html><head><title>Overload Warning</title></head><body><p align="center"><strong>'
.$spammer1.'</strong>'.$br;
echo $spammer2.$ipenalty.$spammer3.'</p></body></html>'.$crlf;
// log occurrence:
$fp=@fopen($iplogdir.$iplogfile,"a");
if ($fp!==FALSE)
{
$useragent='<unknown user agent>';
if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent=$_SERVER["HTTP_USER_AGENT"];
@fputs($fp,$_SERVER["REMOTE_ADDR"].' on '.date("D, d M Y, H:i:s").' as '.$useragent.$crlf);
}
@fclose($fp);
exit();
}
// Modify file time:
touch($iplogdir.$ipfile,$newtime);
?>
This script alone seriously slows down spambots, so they can't suck wild amounts of bandwidth. But it also generates a log, which allows you to periodically put IP blocks in .htaccess for heavy or frequent would-be abusers.
A second toy is an email-harvester trap. The script is simple:
<?php
// makemail.php - create dynamic spurious email address:
// "Constants":
// General:
$blank=' ';
$crlf=chr(13).chr(10);
$br='<br />'.$crlf;
$p='<br /><br />'.$crlf;
// Make Address:
// Get data:
$referrer=trim($_SERVER['REMOTE_HOST']);
$referrer=str_replace('.','_',$referrer);
$at=date("d_m_y_H_i_s");
// Echo address:
$fakedup=$referrer.'__'.$at;
echo 'And this is a spammer-trapping spurious'.$crlf;
echo '<a href="mailto:'.$fakedup.'@mywonderfulsite.com>email</a>'
.' address.)'.$crlf;
?>
You call that script from any shtml file with a simple:
<p align="center"><font color="#cccccc" size="1">
(Do <em><strong>not</strong></em> click here: this is a
<a href="http://mywonderfulsite.com/spweb1.php">false</a> link to catch evil web robots:
anything or anyone visiting that link will be barred from this site.
<br />
<!--#include virtual="/makemail.php" -->
</font></p>
You phrase it and style it, of course, to your exact taste.
The script generates an ad hoc email address that contains the IP address of the thief, plus the exact date and time of the theft. You need, of course, to configure your email software to direct emails addressed in that form to a particular mailbox. The timer script slows down their harvesting--possibly stopping it, I don't know how smart harvester software is about not wasting its time--but this also gets the IP of the thief. Thus, if you ever get a spam email to that address, you have the IP address and the time of the email-address theft, which you can use to explicitly block that thief (so long as it sticks to that address), and moreover the email may be sufficient evidence (in conjunction with the rest of the facts) for prosecution in those jurisdictions that allow suing spammers (I am in one, Washington State, and will see what happens if I get an IP I can track to a person or business entity).
The HTML block shown above also includes a link to a third toy, SpiderWeb, a php script that looks like this (again, only the log directory needs customizing):
<?php
// spweb1.php - Spider Trapper #1
// "Constants":
// General:
$blank=' ';
$crlf=chr(13).chr(10);
$br='<br />'.$crlf;
$p='<br /><br />'.$crlf;
// Particular:
$logdir='logs/';
$logfile=$logdir.'Trap1.Log';
// Loop Them:
header('Location: http://www.hostedscripts.com/scripts/antispam.html')'
// Log Call:
// Get data:
$referrer=trim($_SERVER['REMOTE_HOST']);
if ($referrer==NULL) $referrer='unspecified referrer';
$address=trim($_SERVER['REMOTE_ADDR']);
if ($address==NULL) $address='unspecified address';
$agent=trim($_SERVER['HTTP_USER_AGENT']);
if ($agent==NULL) $agent='unspecified agent';
$query=trim($_SERVER["QUERY_STRING"]);
if ($query==NULL) $query='no query';
$msg=$referrer.$crlf
.' '.$address.$crlf
.' '.$agent.$crlf
.' '.$query.$crlf
.' '.date('l, j F Y, H:i:s',time()-10800).$crlf
.$crlf;
// Log data:
$lhandle=fopen($logfile,'a');
fwrite($lhandle,$msg.$crlf);
fclose($lhandle);
?>
The essence of the trap is that spweb1.php (like makemail.php) is to be blocked in your robots.txt file. Any IP address it logs is a user-agent that ignored your robots.txt file.
(I have a virtually identical spweb2.php that is not linked anywhere: it is named only in the robots.txt file--so any user-agent caught by that trap actually harvests blocked files from robots.txt. There is no need to keep the two kinds of creeps segregated, but I like to be able to see which was which.)
The 302-redirect link is to a neat harvester-poisoning site page, which you can go to and inspect for yourself. The thieves you send there will love it . . . .
I put these forth as probably useful, but more as starting points so that others can gin up their own flavors. The essence, again, is to stop (or, at any rate, very seriously slow down) bots in the act, based purely on their actual behavior as seen in real time.
We all know also the various ways in which unwanted user-agents can, in principle, be stopped: robots.txt blocks by user-agent name, and .htaccess blocks by uer-agent name, or IP address. But those, while helpful, can scarcely do the whole job, inasmuch as they necessarily block particular user-agent names and particular IP addresses. But spammers can and do change those things a lot more often than they probably change their underwear. Using those tools is fighting WWII with the weapons of WWI.
The ideal thing to do is to place controls based on actual bad behavior as it happens. I recently ran across a delightful and helpful thread on another forum (http://www..webmasterworld.com/forum88/119.htm) that presented a clever solution using PHP. The gist of the thing is that it tracks visitor behavior, and visitors that are trying to download too many pages too fast are soon stopped with a 503 and a penalty time before they can load more pages; the time parameters are adjustable, and it can keep track of a sufficiency of simultaneous visitors with very little computational load.
Here is a variant of the script as I have now installed it (with explanatory comments built it)--about all one needs to customize is the directory where the logfile is kept.
<?php
// ENGLISH-LANGUAGE VERSION:
/*
Notes...
* $itime is the minimum number of seconds between visits _on average_ over
$itime*$imaxvisit seconds. So in the example, a visitor isn't blocked
if it visits the script multiple times in the first 5 seconds, as long
as it doesn't visit more than 60 times within 300 seconds (5 minutes).
* If the limit is reached, $ipenalty is the number of seconds a visitor
has to wait before being allowed back.
An MD5 hash is made of each visitor's IP address, and the last 3 hex digits of that hash are used to generate one of a possible 4096 filenames. If it is a new visitor, or a visitor who hasn't been seen for a while, the timestamp of the file is set to the then-current time; otherwise, it must be a recent visitor, and the time stamp is increased by $itime.
If the visitor starts loading the timer script more rapidly than $itime seconds per visit,the time stamp on the IP-hashed filename will be increasing faster than the actual time is increasing. If the time stamp gets too far ahead of the current time, the visitor is branded a bad visitor and the penalty is applied by increasing the time stamp on its file even further.
4096 separate hash files is enough that it's very unlikely you'll get two visitors at exactly the same time with the same hash, but not so many that you need to keep tidying up the files.
(Even if you do get more than one visitor with the same hash file at the same time, it's no great disaster: they'll just approach the throttle limit a little faster, which in most cases won't matter, as the limits in the example--5/60/60--are quite generous.)
This script can be simply included in each appropriate php script with this:
// Spam-Block:
include('timer.inc');
*/
// INITIALIZATIONS:
// Constants:
// Fixed:
$crlf=chr(13).chr(10);
$itime=5; // minimum number of seconds between one-visitor visits
$imaxvisit=60; // maximum visits in $itime x $imaxvisits seconds
$ipenalty=60; // seconds before visitor is allowed back
$iplogdir="../logs/";
$iplogfile="ErrantIPs.Log";
// Language-dependent:
$spammer1='The Server is momentarily under heavy load.';
$spammer2='Please wait ';
$spammer3=' seconds and try again.';
// OPERATION:
// Make Check:
// Get file time:
$ipfile=substr(md5($_SERVER["REMOTE_ADDR"]),-3); // -3 means 4096 possible files
$oldtime=0;
if (file_exists($iplogdir.$ipfile)) $oldtime=filemtime($iplogdir.$ipfile);
// Update times:
$time=time();
if ($oldtime<$time) $oldtime=$time;
$newtime=$oldtime+$itime;
// Stop overuser:
if ($newtime>=$time+$itime*$imaxvisit)
{
// block visitor:
touch($iplogdir.$ipfile,$time+$itime*($imaxvisit-1)+$ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo '<html><head><title>Overload Warning</title></head><body><p align="center"><strong>'
.$spammer1.'</strong>'.$br;
echo $spammer2.$ipenalty.$spammer3.'</p></body></html>'.$crlf;
// log occurrence:
$fp=@fopen($iplogdir.$iplogfile,"a");
if ($fp!==FALSE)
{
$useragent='<unknown user agent>';
if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent=$_SERVER["HTTP_USER_AGENT"];
@fputs($fp,$_SERVER["REMOTE_ADDR"].' on '.date("D, d M Y, H:i:s").' as '.$useragent.$crlf);
}
@fclose($fp);
exit();
}
// Modify file time:
touch($iplogdir.$ipfile,$newtime);
?>
This script alone seriously slows down spambots, so they can't suck wild amounts of bandwidth. But it also generates a log, which allows you to periodically put IP blocks in .htaccess for heavy or frequent would-be abusers.
A second toy is an email-harvester trap. The script is simple:
<?php
// makemail.php - create dynamic spurious email address:
// "Constants":
// General:
$blank=' ';
$crlf=chr(13).chr(10);
$br='<br />'.$crlf;
$p='<br /><br />'.$crlf;
// Make Address:
// Get data:
$referrer=trim($_SERVER['REMOTE_HOST']);
$referrer=str_replace('.','_',$referrer);
$at=date("d_m_y_H_i_s");
// Echo address:
$fakedup=$referrer.'__'.$at;
echo 'And this is a spammer-trapping spurious'.$crlf;
echo '<a href="mailto:'.$fakedup.'@mywonderfulsite.com>email</a>'
.' address.)'.$crlf;
?>
You call that script from any shtml file with a simple:
<p align="center"><font color="#cccccc" size="1">
(Do <em><strong>not</strong></em> click here: this is a
<a href="http://mywonderfulsite.com/spweb1.php">false</a> link to catch evil web robots:
anything or anyone visiting that link will be barred from this site.
<br />
<!--#include virtual="/makemail.php" -->
</font></p>
You phrase it and style it, of course, to your exact taste.
The script generates an ad hoc email address that contains the IP address of the thief, plus the exact date and time of the theft. You need, of course, to configure your email software to direct emails addressed in that form to a particular mailbox. The timer script slows down their harvesting--possibly stopping it, I don't know how smart harvester software is about not wasting its time--but this also gets the IP of the thief. Thus, if you ever get a spam email to that address, you have the IP address and the time of the email-address theft, which you can use to explicitly block that thief (so long as it sticks to that address), and moreover the email may be sufficient evidence (in conjunction with the rest of the facts) for prosecution in those jurisdictions that allow suing spammers (I am in one, Washington State, and will see what happens if I get an IP I can track to a person or business entity).
The HTML block shown above also includes a link to a third toy, SpiderWeb, a php script that looks like this (again, only the log directory needs customizing):
<?php
// spweb1.php - Spider Trapper #1
// "Constants":
// General:
$blank=' ';
$crlf=chr(13).chr(10);
$br='<br />'.$crlf;
$p='<br /><br />'.$crlf;
// Particular:
$logdir='logs/';
$logfile=$logdir.'Trap1.Log';
// Loop Them:
header('Location: http://www.hostedscripts.com/scripts/antispam.html')'
// Log Call:
// Get data:
$referrer=trim($_SERVER['REMOTE_HOST']);
if ($referrer==NULL) $referrer='unspecified referrer';
$address=trim($_SERVER['REMOTE_ADDR']);
if ($address==NULL) $address='unspecified address';
$agent=trim($_SERVER['HTTP_USER_AGENT']);
if ($agent==NULL) $agent='unspecified agent';
$query=trim($_SERVER["QUERY_STRING"]);
if ($query==NULL) $query='no query';
$msg=$referrer.$crlf
.' '.$address.$crlf
.' '.$agent.$crlf
.' '.$query.$crlf
.' '.date('l, j F Y, H:i:s',time()-10800).$crlf
.$crlf;
// Log data:
$lhandle=fopen($logfile,'a');
fwrite($lhandle,$msg.$crlf);
fclose($lhandle);
?>
The essence of the trap is that spweb1.php (like makemail.php) is to be blocked in your robots.txt file. Any IP address it logs is a user-agent that ignored your robots.txt file.
(I have a virtually identical spweb2.php that is not linked anywhere: it is named only in the robots.txt file--so any user-agent caught by that trap actually harvests blocked files from robots.txt. There is no need to keep the two kinds of creeps segregated, but I like to be able to see which was which.)
The 302-redirect link is to a neat harvester-poisoning site page, which you can go to and inspect for yourself. The thieves you send there will love it . . . .
I put these forth as probably useful, but more as starting points so that others can gin up their own flavors. The essence, again, is to stop (or, at any rate, very seriously slow down) bots in the act, based purely on their actual behavior as seen in real time.