Debt - Lockers - Remortgages - Debt Consolidation - Debt Free

PDA

View Full Version : Block all the bad bots from your site!!!


ketan9
Oct 31st 2007, 9:27 am
If you would like to block all the unwanted user agents that scrap your site or the bots that you don't want to access your site then in your http.conf file add the following statements. Below I have added a big list of user agents that I deny access, you may want to edit the list depending on your requirements.


#The following part goes outside all virtualhost or director specifications in httpd.conf
SetEnvIfNoCase User-Agent "^Alexibot*" bad_bot
SetEnvIfNoCase User-Agent "^Anarchie*" bad_bot
SetEnvIfNoCase User-Agent "^Aqua_Products*" bad_bot
SetEnvIfNoCase User-Agent "^asterias*" bad_bot
SetEnvIfNoCase User-Agent "^autoemailspider*" bad_bot
SetEnvIfNoCase User-Agent "^b2w*" bad_bot
SetEnvIfNoCase User-Agent "^BackDoorBot*" bad_bot
SetEnvIfNoCase User-Agent "^Black.Hole*" bad_bot
SetEnvIfNoCase User-Agent "^BlackWidow*" bad_bot
SetEnvIfNoCase User-Agent "^BlowFish*" bad_bot
SetEnvIfNoCase User-Agent "^Bookmark search tool*" bad_bot
SetEnvIfNoCase User-Agent "^BotALot*" bad_bot
SetEnvIfNoCase User-Agent "^BotRightHere*" bad_bot
SetEnvIfNoCase User-Agent "^BuiltBotTough*" bad_bot
SetEnvIfNoCase User-Agent "^Bullseye*" bad_bot
SetEnvIfNoCase User-Agent "^BunnySlippers*" bad_bot
SetEnvIfNoCase User-Agent "^BlackWidow*" bad_bot
SetEnvIfNoCase User-Agent "^Bloodhound*" bad_bot
SetEnvIfNoCase User-Agent "^BotRightHere*" bad_bot
SetEnvIfNoCase User-Agent "^Bumblebee*" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker*" bad_bot
SetEnvIfNoCase User-Agent "^Crescent*" bad_bot
SetEnvIfNoCase User-Agent "^CIS TE*" bad_bot
SetEnvIfNoCase User-Agent "^Cegbfeieh*" bad_bot
SetEnvIfNoCase User-Agent "^CheeseBot*" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker*" bad_bot
SetEnvIfNoCase User-Agent "^ChinaClaw*" bad_bot
SetEnvIfNoCase User-Agent "^Copernic*" bad_bot
SetEnvIfNoCase User-Agent "^CopyRightCheck*" bad_bot
SetEnvIfNoCase User-Agent "^Cosmos*" bad_bot
SetEnvIfNoCase User-Agent "^Custo*" bad_bot
SetEnvIfNoCase User-Agent "^DISCo*" bad_bot
SetEnvIfNoCase User-Agent "^DittoSpyder*" bad_bot
SetEnvIfNoCase User-Agent "^Download*" bad_bot
SetEnvIfNoCase User-Agent "^Dart*" bad_bot
SetEnvIfNoCase User-Agent "^DA*" bad_bot
SetEnvIfNoCase User-Agent "^DIIbot*" bad_bots
SetEnvIfNoCase User-Agent "^DiscoPump*" bad_bot
SetEnvIfNoCase User-Agent "^Download Ninja*" bad_bot
SetEnvIfNoCase User-Agent "^Drip*" bad_bot
SetEnvIfNoCase User-Agent "^EirGrabber*" bad_bot
SetEnvIfNoCase User-Agent "^EroCrawler*" bad_bot
SetEnvIfNoCase User-Agent "^Express*" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro*" bad_bot
SetEnvIfNoCase User-Agent "^EyeNetIE*" bad_bot
SetEnvIfNoCase User-Agent "^eCatch*" bad_bot
SetEnvIfNoCase User-Agent "^e-collector*" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector*" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector*" bad_bot
SetEnvIfNoCase User-Agent "^FairAd Client*" bad_bot
SetEnvIfNoCase User-Agent "^Flaming AttackBot*" bad_bot
SetEnvIfNoCase User-Agent "^FlashGet*" bad_bot
SetEnvIfNoCase User-Agent "^Foobot*" bad_bot
SetEnvIfNoCase User-Agent "^FrontPage*" bad_bot
SetEnvIfNoCase User-Agent "^FAST -WebCrawler*" bad_bot
SetEnvIfNoCase User-Agent "^fastlwspider*" bad_bot
SetEnvIfNoCase User-Agent "^FlashGet*" bad_bot
SetEnvIfNoCase User-Agent "^FunWeb*" bad_bot
SetEnvIfNoCase User-Agent "^Gaisbot*" bad_bot
SetEnvIfNoCase User-Agent "^GetRight*" bad_bot
SetEnvIfNoCase User-Agent "^GetWeb*" bad_bot
SetEnvIfNoCase User-Agent "^Go-Ahead-Got-It*" bad_bot
SetEnvIfNoCase User-Agent "^GrabNet*" bad_bot
SetEnvIfNoCase User-Agent "^Grafula*" bad_bot
SetEnvIfNoCase User-Agent "^Getleft*" bad_bot
SetEnvIfNoCase User-Agent "^Gets*" bad_bot
SetEnvIfNoCase User-Agent "^GetWebPage*" bad_bot
SetEnvIfNoCase User-Agent "^GetYou*" bad_bot
SetEnvIfNoCase User-Agent "^Gozilla*" bad_bot
SetEnvIfNoCase User-Agent "^Go!Zilla*" bad_bot
SetEnvIfNoCase User-Agent "^Harvest*" bad_bot
SetEnvIfNoCase User-Agent "^hloader*" bad_bot
SetEnvIfNoCase User-Agent "^HMView*" bad_bot
SetEnvIfNoCase User-Agent "^httplib*" bad_bot
SetEnvIfNoCase User-Agent "^HTTrack*" bad_bot
SetEnvIfNoCase User-Agent "^humanlinks*" bad_bot
SetEnvIfNoCase User-Agent "^HTTrack*" bad_bot
SetEnvIfNoCase User-Agent "^ia_archiver*" bad_bot
SetEnvIfNoCase User-Agent "^IBrowse*" bad_bot
SetEnvIfNoCase User-Agent "^ImageGrab*" bad_bot
SetEnvIfNoCase User-Agent "^InterGET*" bad_bot
SetEnvIfNoCase User-Agent "^Internet Ninja*" bad_bot
SetEnvIfNoCase User-Agent "^Iria*" bad_bot
SetEnvIfNoCase User-Agent "^ia_archiver*" bad_bot
SetEnvIfNoCase User-Agent "^Image*" bad_bot
SetEnvIfNoCase User-Agent "^Indy*" bad_bot
SetEnvIfNoCase User-Agent "^InfoNaviRobot*" bad_bot
SetEnvIfNoCase User-Agent "^InterGET*" bad_bot
SetEnvIfNoCase User-Agent "^Internet*" bad_bot
SetEnvIfNoCase User-Agent "^Iron33*" bad_bot
SetEnvIfNoCase User-Agent "^Java*" bad_bot
SetEnvIfNoCase User-Agent "^JBH Agent*" bad_bot
SetEnvIfNoCase User-Agent "^JetCar*" bad_bot
SetEnvIfNoCase User-Agent "^JustView*" bad_bot
SetEnvIfNoCase User-Agent "^JennyBot*" bad_bot
SetEnvIfNoCase User-Agent "^JetCar*" bad_bot
SetEnvIfNoCase User-Agent "^JOC*" bad_bot
SetEnvIfNoCase User-Agent "^Kenjin.Spider*" bad_bot
SetEnvIfNoCase User-Agent "^Keyword Density*" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial*" bad_bot
SetEnvIfNoCase User-Agent "^LeechFTP*" bad_bot
SetEnvIfNoCase User-Agent "^LinkWalker*" bad_bots
SetEnvIfNoCase User-Agent "^larbin*" bad_bot
SetEnvIfNoCase User-Agent "^LexiBot*" bad_bot
SetEnvIfNoCase User-Agent "^libWeb/clsHTTP*" bad_bot
SetEnvIfNoCase User-Agent "^LinkextractorPro*" bad_bot
SetEnvIfNoCase User-Agent "^LinkScan*" bad_bot
SetEnvIfNoCase User-Agent "^LNSpiderguy*" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial*" bad_bot
SetEnvIfNoCase User-Agent "^Mass*" bad_bot
SetEnvIfNoCase User-Agent "^Mata Hari*" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft URL Control*" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft.URL*" bad_bot
SetEnvIfNoCase User-Agent "^MIDown*" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc*" bad_bot
SetEnvIfNoCase User-Agent "^Mister*" bad_bot
SetEnvIfNoCase User-Agent "^Mister.PiX*" bad_bot
SetEnvIfNoCase User-Agent "^moget*" bad_bot
SetEnvIfNoCase User-Agent "^NEWT*" bad_bot
SetEnvIfNoCase User-Agent "^MS FrontPage*" bad_bot
SetEnvIfNoCase User-Agent "^MSIECrawler*" bad_bot
SetEnvIfNoCase User-Agent "^MSProxy" bad_bot
SetEnvIfNoCase User-Agent "^Mass Down*" bad_bot
SetEnvIfNoCase User-Agent "^MD download*" bad_bot
SetEnvIfNoCase User-Agent "^MemoWeb*" bad_bot
SetEnvIfNoCase User-Agent "^MetaProducts*" bad_bot
SetEnvIfNoCase User-Agent "^MFC_Tear*" bad_bots
SetEnvIfNoCase User-Agent "^MIDown tool*" bad_bot
SetEnvIfNoCase User-Agent "^minibot*" bad_bot
SetEnvIfNoCase User-Agent "^MyGetRight*" bad_bot
SetEnvIfNoCase User-Agent "^MyWay*" bad_bot
SetEnvIfNoCase User-Agent "^Navroad*" bad_bot
SetEnvIfNoCase User-Agent "^NearSite*" bad_bot
SetEnvIfNoCase User-Agent "^Net" bad_bot
SetEnvIfNoCase User-Agent "^NetAnts*" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic*" bad_bot
SetEnvIfNoCase User-Agent "^NetSpider*" bad_bot
SetEnvIfNoCase User-Agent "^NetZIP*" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO*" bad_bot
SetEnvIfNoCase User-Agent "^NPBot*" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic*" bad_bot
SetEnvIfNoCase User-Agent "^NetSpider*" bad_bot
SetEnvIfNoCase User-Agent "^NetZip*" bad_bot
SetEnvIfNoCase User-Agent "^NearSite*" bad_bot
SetEnvIfNoCase User-Agent "^obot*" bad_bot
SetEnvIfNoCase User-Agent "^Offline*" bad_bot
SetEnvIfNoCase User-Agent "^Octopus*" bad_bot
SetEnvIfNoCase User-Agent "^Openbot*" bad_bot
SetEnvIfNoCase User-Agent "^Openfind*" bad_bot
SetEnvIfNoCase User-Agent "^Oracle Ultra Search*" bad_bot
SetEnvIfNoCase User-Agent "^PageGrabber*" bad_bot
SetEnvIfNoCase User-Agent "^Pockey*" bad_bot
SetEnvIfNoCase User-Agent "^Prozilla*" bad_bot
SetEnvIfNoCase User-Agent "^Papa*" bad_bot
SetEnvIfNoCase User-Agent "^pavuk*" bad_bot
SetEnvIfNoCase User-Agent "^pcBrowser*" bad_bot
SetEnvIfNoCase User-Agent "^PerMan*" bad_bot
SetEnvIfNoCase User-Agent "^ProPowerBot*" bad_bot
SetEnvIfNoCase User-Agent "^ProWebWalker*" bad_bot
SetEnvIfNoCase User-Agent "^psbot*" bad_bot
SetEnvIfNoCase User-Agent "^Python-urllib*" bad_bot
SetEnvIfNoCase User-Agent "^QRVA*" bad_bot
SetEnvIfNoCase User-Agent "^QueryN.Metasearch*" bad_bot
SetEnvIfNoCase User-Agent "^Radiation Retriever*" bad_bot
SetEnvIfNoCase User-Agent "^ReGet*" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey*" bad_bot
SetEnvIfNoCase User-Agent "^RMA*" bad_bot
SetEnvIfNoCase User-Agent "^RealDownload*" bad_bot
SetEnvIfNoCase User-Agent "^Reaper*" bad_bot
SetEnvIfNoCase User-Agent "^Recorder*" bad_bot
SetEnvIfNoCase User-Agent "^searchpreview*" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger*" bad_bot
SetEnvIfNoCase User-Agent "^SlySearch*" bad_bot
SetEnvIfNoCase User-Agent "^SmartDownload*" bad_bot
SetEnvIfNoCase User-Agent "^SpankBot*" bad_bot
SetEnvIfNoCase User-Agent "^spanner*" bad_bot
SetEnvIfNoCase User-Agent "^SuperBot*" bad_bot
SetEnvIfNoCase User-Agent "^SuperHTTP*" bad_bot
SetEnvIfNoCase User-Agent "^Surfbot*" bad_bot
SetEnvIfNoCase User-Agent "^suzuran*" bad_bot
SetEnvIfNoCase User-Agent "^Szukacz*" bad_bot
SetEnvIfNoCase User-Agent "^Scooter*" bad_bot
SetEnvIfNoCase User-Agent "^Slurp*" bad_bot
SetEnvIfNoCase User-Agent "^SpaceBison*" bad_bot
SetEnvIfNoCase User-Agent "^Star Downloader*" bad_bot
SetEnvIfNoCase User-Agent "^Stripper*" bad_bot
SetEnvIfNoCase User-Agent "^Sucker*" bad_bot
SetEnvIfNoCase User-Agent "^Surfbot*" bad_bot
SetEnvIfNoCase User-Agent "^SurfWalker*" bad_bot
SetEnvIfNoCase User-Agent "^tAkeOut*" bad_bot
SetEnvIfNoCase User-Agent "^Teleport*" bad_bot
SetEnvIfNoCase User-Agent "^Telesoft*" bad_bot
SetEnvIfNoCase User-Agent "^Turnitin*" bad_bot
SetEnvIfNoCase User-Agent "^The Intraformant*" bad_bot
SetEnvIfNoCase User-Agent "^The.Intraformant*" bad_bot
SetEnvIfNoCase User-Agent "^TheNomad*" bad_bot
SetEnvIfNoCase User-Agent "^TightTwatBot*" bad_bot
SetEnvIfNoCase User-Agent "^Titan*" bad_bot
SetEnvIfNoCase User-Agent "^toCrawl/UrlDispatcher*" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot*" bad_bot
SetEnvIfNoCase User-Agent "^turingos*" bad_bot
SetEnvIfNoCase User-Agent "^URL Control*" bad_bot
SetEnvIfNoCase User-Agent "^URL_Spider*" bad_bot
SetEnvIfNoCase User-Agent "^URLy.Warning*" bad_bot
SetEnvIfNoCase User-Agent "^VCI*" bad_bot
SetEnvIfNoCase User-Agent "^VoidEYE*" bad_bot
SetEnvIfNoCase User-Agent "^Vacuum*" bad_bot
SetEnvIfNoCase User-Agent "^vobsub*" bad_bot
SetEnvIfNoCase User-Agent "^w3mir*" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto*" bad_bot
SetEnvIfNoCase User-Agent "^[Ww]eb[Bb]andit*" bad_bot
SetEnvIfNoCase User-Agent "^WebCapture*" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier*" bad_bot
SetEnvIfNoCase User-Agent "^Web Downloader*" bad_bot
SetEnvIfNoCase User-Agent "^WebDownloader*" bad_bot
SetEnvIfNoCase User-Agent "^Webdupe*" bad_bot
SetEnvIfNoCase User-Agent "^WebEMailExtrac*" bad_bot
SetEnvIfNoCase User-Agent "^Web[Ff]etch*" bad_bot
SetEnvIfNoCase User-Agent "^WebFountain*" bad_bot
SetEnvIfNoCase User-Agent "^WebHook*" bad_bot
SetEnvIfNoCase User-Agent "^Web Image*" bad_bot
SetEnvIfNoCase User-Agent "^WebImageCollector*" bad_bot
SetEnvIfNoCase User-Agent "^WebEMailExtractor*" bad_bot
SetEnvIfNoCase User-Agent "^WebMiner*" bad_bot
SetEnvIfNoCase User-Agent "^WebMirror*" bad_bot
SetEnvIfNoCase User-Agent "^WebReaper*" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger*" bad_bot
SetEnvIfNoCase User-Agent "^Website" bad_bot
SetEnvIfNoCase User-Agent "^Website eXtractor*" bad_bot
SetEnvIfNoCase User-Agent "^Webster*" bad_bot
SetEnvIfNoCase User-Agent "^WebStripper*" bad_bot
SetEnvIfNoCase User-Agent "^Web Sucker*" bad_bot
SetEnvIfNoCase User-Agent "^WebSucker*" bad_bot
SetEnvIfNoCase User-Agent "^WebWhacker*" bad_bot
SetEnvIfNoCase User-Agent "^WebZ[Ii][Pp]*" bad_bot
SetEnvIfNoCase User-Agent "^Wget*" bad_bot
SetEnvIfNoCase User-Agent "^WhizBang*" bad_bots
SetEnvIfNoCase User-Agent "^Widow*" bad_bot
SetEnvIfNoCase User-Agent "^WWW-Collector-E*" bad_bot
SetEnvIfNoCase User-Agent "^WWWOFFLE*" bad_bot
SetEnvIfNoCase User-Agent "^Web*" bad_bot
SetEnvIfNoCase User-Agent "^Web.Image.Collector*" bad_bot
SetEnvIfNoCase User-Agent "^WebEMailExtrac.*" bad_bot
SetEnvIfNoCase User-Agent "^WebEnhancer*" bad_bot
SetEnvIfNoCase User-Agent "^WebGo*" bad_bot
SetEnvIfNoCase User-Agent "^WebLeacher*" bad_bot
SetEnvIfNoCase User-Agent "^WebmasterWorldForumBot*" bad_bot
SetEnvIfNoCase User-Agent "^Website.Quester*" bad_bot
SetEnvIfNoCase User-Agent "^Webster*" bad_bot
SetEnvIfNoCase User-Agent "^WebWhacker*" bad_bot
SetEnvIfNoCase User-Agent "^Wget*" bad_bot
SetEnvIfNoCase User-Agent "^Widow*" bad_bot
SetEnvIfNoCase User-Agent "^Xaldon*" bad_bot
SetEnvIfNoCase User-Agent "^Xenu*" bad_bot
SetEnvIfNoCase User-Agent "^Zeus*" bad_bot

# The following goes in your virtualhost declarations. You will have to repeat for as many virtualhost you may have!
<Location "/">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Location>


Hope this would help you

KalvinB
Oct 31st 2007, 11:44 am
If I'm going to use a bot to scrape a site, I set the user agent to a valid IE user agent. These types of lists don't stop much of anything and just add extra processing time on your web-server.

If someone is scraping your site you're better off blocking their IP at the network level. I used to do that with my Windows 2000 server. I just routed IPs to never never land using Windows built in functions for that sort of thing. It's a lot more efficient than making apache do it. Linux can also block IPs at the network level.

You could even have your site keep track of what IPs are downloading what and auto block their IP if you felt so inclined.

markn26
Oct 31st 2007, 11:46 am
I think this can be done in robots.txt as well, I saw a robots generator that did something to this extent.

ketan9
Oct 31st 2007, 2:01 pm
If I'm going to use a bot to scrape a site, I set the user agent to a valid IE user agent. These types of lists don't stop much of anything and just add extra processing time on your web-server.

If someone is scraping your site you're better off blocking their IP at the network level. I used to do that with my Windows 2000 server. I just routed IPs to never never land using Windows built in functions for that sort of thing. It's a lot more efficient than making apache do it. Linux can also block IPs at the network level.

You could even have your site keep track of what IPs are downloading what and auto block their IP if you felt so inclined.

I agree with you. Although finding the ip and blocking them manually takes time and big effort. I am looking for a way to do it automatically meaning, if someone consumes too much of bandwidth, stop him from using the site altogether and couldn't find a better way to do it. Let me know if you have a better approach.

KalvinB
Nov 1st 2007, 10:31 am
Apache 2 I believe has bandwidth throttling.

I also design my sites so everything (except images and js) goes through index.php, even downloads. So if I feel a need to I can log per IP usage and issue the Windows command to reroute IPs automatically if an IP uses more bandwidth per day than allowed.