That's very useful JD. I'm not sure we are utilizing it on our server, but I've asked my tech to talk to me about it. I can think of several sites that we host that would really benefit from this.
You'd usually benefit from this if your visitors view more than just one-two pages or if you expect a lot of corporate users (which usually share a caching proxy). One thing to keep in mind with regards to this is that if you configure content to expire over an extended period of time (e.g. a day or greater), you will need to lower this time before the deployment, so that all caches would get the new stuff sooner. J.D.
Except you also want your pages to load fast for your viewers... Another thing that helps is take out any comments and whitespace from your HTML. (It's not much of a savings, but it does add up.)
optimize your Html page.. another thing I did is,, I put my HTML pages to my own paid hosting then upload all my images/swf to Free hosting site that has unlimited transfer. Cheers!
lst day i put this meta tag in mt html file <META NAME="revisit-after" CONTENT="7 days"> but today when i saw the logs their again msn,google robots are coming,i don know why....
If the bots are not major SE's and they take up a lot of bandwidth, send them away RewriteEngine on RewriteBase / RewriteCond %{HTTP_USER_AGENT} ^psbot [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR] RewriteCond %{HTTP_USER_AGENT} ^AWSM [OR] RewriteCond %{HTTP_USER_AGENT} ^sna [OR] RewriteCond %{HTTP_USER_AGENT} ^aipbot RewriteRule !^http://[^/.]\.yoursite.com.* - [F] PHP: There are a lot more bots out there, but those are ones that were attacking one of my sites.
The revisit-after Meta tag has been ignored by spiders since the 1990s. There are dozens of Meta tags and the majority are completely ignored by search engines. They are just worthless overhead.
Those are from his .htaccess file. BTW, here's a copy of mine. It came from WMW, myself and contributions from other webmasters. -jay # These are my ban the nasty bot/harvester list - jw # # Banning BOTS bellow # Address harvesters RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider|ExtractorPro) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect|Harvest|Magnet|Reaper|Siphon|Sweeper|Wolf) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent|Email.?Extrac) [NC,OR] RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^PlantyNet_WebRobot [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^NutchCVS [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^gamekitbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^ichiro [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^avuk [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^bdfetch [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^AIBOT [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^aibot [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Libby [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Jakarta [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^mysearch [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^OmniExplor [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^PHP/4.2.2 [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^POE [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^SearchIndy [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Xenu [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^rameda [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Huron [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^LWP [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^spider [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^noxtrum [NC,OR] # Download managers RewriteCond %{HTTP_USER_AGENT} ^(Alligator|DA.?[0-9]|DC\-Sakura|Download.?(Demon|Express|Master|Wonder)|FileHound) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Flash|Leech)Get [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Fresh|Lightning|Mass|Real|Smart|Speed|Star).?Download(er)? [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Gamespy|Go!Zilla|iGetter|JetCar|Net(Ants|Pumper)|SiteSnagger|Teleport.?Pro|WebReaper|NutchCVS?) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR] # Image-grabbers RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot|FlickBot|webcollage) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Express|Mister|Web).?(Web|Pix|Image).?(Pictures|Collector)? [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch|Stripper|Sucker) [NC,OR] # "Gray-hats" RewriteCond %{HTTP_USER_AGENT} ^(Atomz|BlackWidow|BlogBot|EasyDL|Marketwave|Sqworm|SurveyBot|Webclipping\.com) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (girafa\.com|gossamer\-threads\.com|grub\-client|Netcraft|Nutch) [NC,OR] # Site-grabbers RewriteCond %{HTTP_USER_AGENT} ^(eCatch|(Get|Super)Bot|Kapere|HTTrack|JOC|Offline|UtilMind|Xaldon) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto|Cop|dup|Fetch|Filter|Gather|Go|Leach|Mine|Mirror|Pix|QL|RACE|Sauger) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor|Quester)|Snake|ster|Strip|Suck|vac|walk|Whacker|ZIP) [NC,OR] RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR] # Tools RewriteCond %{HTTP_USER_AGENT} ^(curl|Dart.?Communications|Enfish|htdig|Java|larbin) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (FrontPage|Indy.?Library|RPT\-HTTPClient) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(libwww|lwp|PHP|Python|www\.thatrobotsite\.com|webbandit|Wget|Zeus) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Microsoft|MFC).(Data|Internet|URL|WebDAV|Foundation).(Access|Explorer|Control|MiniRedir|Class) [NC,OR] # Unknown RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application|Lachesis|Nutscrape) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse|Eval|Surf) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Demo|Full.?Web|Lite|Production|Franklin|Missauga|Missigua).?(Bot|Locat) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net|hhjhj@yahoo\.com|lerly\.net|mapfeatures\.net|metacarta\.com) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Industry|Internet|IUFW|Lincoln|Missouri|Program).?(Program|Explore|Web|State|College|Shareware) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Mac|Ram|Educate|WEP).?(Finder|Search) [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa|MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^NaverRobot [NC] RewriteRule .* - [F] Code (markup):
Here's a command/script that I use to quickly identify new bots / harvesters that don't mimic the IE / Mozilla / Firefox UAs. I can't remember where I originally found it but it has been helpful to me so I hope you get some use out of it too! cat /var/log/httpd/access_log | awk -F "\"" {'print $6'} | sort | uniq | grep -v Mozilla Code (markup):
Keep in mind that this list will be evaluated on every hit. It's a very long list and it will certainly affect your server's performance. You might want to check if the saved bandwidth is really worth it. For example, measure how much bandwidth these robots use per week and also check how much slower your server works with these conditions in place (you'll need a stress tool for this). J.D.
JD, You are correct. I've tested it and in my case it adds about 1-3% to the processor utilization. I'm running on a dedicated machine so it doesn't effect others and to me its worth the trade-off. I get deluged by email harvesters, scrapers and misbehaving bots and this was the easiest / quickest way to block them. But I have been seeing a new breed of harvesters that rotate their UAs on each pageview and each UA is a legitimate string AND the harvester software is smart enough now to avoid my spider traps. One solution I'm working on is to force user authentication via a character string graphically represented if they view more than x pages - Hopefully this will catch the rest of them and not endanger the user experience on the site. Also, if I see a lot of bad activity from a particular ISP/country use apf to block the entire CIDR range. If you have suggestions or other tactics for performance or security I would appreciate hearing it. Thanks, -jay
It's not bad. I expected it to be higher - in the 5-10% range (regex can be pretty expensive). Legitimate robots are easy to control with robots.txt and don't need to be in the rewrite list. Some user agents (like Java client) do not check for robots.txt and should be blocked at the rewrite level. Evil robots can only be blocked reliably at the IP level. I monitor traffic and if I see unreasinably high activity from a certain range, I usually block the source for some time (sometimes the entire block - depends on the usage pattern). The idea of presenting graphics is interesting, but may drive some of your visitors away. I wouldn't use it. J.D.
Saving bandwidth is a matter of compressing the images which are the biggest factor for bandwidth usages. We have a unique tool that can do the job.