Filter out Bots in .htaccess

Discussion in 'Apache' started by jaguar34, Aug 24, 2015.

  1. #1
    My site get lots of bots visiting it, and want to know how do I prevent the bots and filter them out with .htaccess ?
     
    jaguar34, Aug 24, 2015 IP
  2. Dayvi

    Dayvi Member

    Messages:
    41
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    33
    #2
    This may help: https://perishablepress.com/2014-micro-blacklist/

    Customize it to suit your needs.
     
    Dayvi, Aug 24, 2015 IP
  3. sarahk

    sarahk iTamer Staff

    Messages:
    28,875
    Likes Received:
    4,547
    Best Answers:
    123
    Trophy Points:
    665
    #3
    Before you go and do a stack of work blocking the bots have a really good think about why you need to block them.

    About 10 years ago I had a project to monitor a number of websites, identify the bots, research them and publish my findings.
    I thought it would be simple but was quickly overwhelmed by the number of bots out there - I'm talking hundreds and hundreds.

    If you really want to take that on, then go for it, but honestly, none are doing any harm, just let them run.
     
    sarahk, Aug 24, 2015 IP
  4. qwikad.com

    qwikad.com Illustrious Member Affiliate Manager

    Messages:
    7,361
    Likes Received:
    1,713
    Best Answers:
    31
    Trophy Points:
    475
    #4
    I'd suggest blocking as many bad bots as possible. Place this into your .htaccess file. The list isn't new, so I am sure there are numerous new bad bots you can add to it. Just Google them:

    ## Bad Bots ##
    
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Twiceler [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sogou\ web\ spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^YandexBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^spbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DotBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Jakarta\ Commons [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sosospider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^bixolabs [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GeoHasher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Yeti [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mail.Ru [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LMQueueBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoilaBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ScrapeBox [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Huaweisymantecspider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Nutch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^AuditMyPC [OR]
    RewriteCond %{HTTP_USER_AGENT} ^xml-sitemaps [OR]
    RewriteCond %{HTTP_USER_AGENT} ^IUPUI\ Research\ Bot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
    RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
    RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Gigabot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BotRightHere [OR]
    RewriteCond %{HTTP_USER_AGENT} ^b2w/0.1 [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Copernic [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetMechanic [OR]
    RewriteCond %{HTTP_USER_AGENT} ^URL_Spider_Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^CopyRightCheck [OR]
    RewriteCond %{HTTP_USER_AGENT} ^CheeseBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LNSpiderguy [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
    RewriteCond %{HTTP_USER_AGENT} ^moget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^TheNomad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWW-Collector-E [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RMA [OR]
    RewriteCond %{HTTP_USER_AGENT} ^libWeb/clsHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
    RewriteCond %{HTTP_USER_AGENT} ^httplib [OR]
    RewriteCond %{HTTP_USER_AGENT} ^turingos [OR]
    RewriteCond %{HTTP_USER_AGENT} ^spanner [OR]
    RewriteCond %{HTTP_USER_AGENT} ^InfoNaviRobot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Harvest [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bullseye [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Foobot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SpankBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BotALot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^lwp-trivial [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BunnySlippers [OR]
    RewriteCond %{HTTP_USER_AGENT} ^URLy\ Warning [OR]
    RewriteCond %{HTTP_USER_AGENT} ^cosmos [OR]
    RewriteCond %{HTTP_USER_AGENT} ^hloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^humanlinks [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LinkextractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mata\ Hari [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LexiBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^The\ Intraformant [OR]
    RewriteCond %{HTTP_USER_AGENT} ^True_Robot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BlowFish [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JennyBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BuiltBotTough [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ProPowerBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^suzuran [OR]
    RewriteCond %{HTTP_USER_AGENT} ^TightTwatBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VCI [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Szukacz [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Openfind\ data\ gatherer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Openfind [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Openbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^URL\ Control [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus\ Link\ Scout [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Webster\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EroCrawler [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Keyword\ Density [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Kenjin\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Iron33 [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bookmark\ search\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FairAd\ Client [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Gaisbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Aqua_Products [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Radiation\ Retriever [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Flaming\ AttackBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Curl [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Reaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebVulnCrawl [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebVulnScan [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Black\ Hole [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Titan [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Cegbfeieh [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LinkScan [OR]
    RewriteCond %{HTTP_USER_AGENT} ^QueryN\ Metasearch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xenu's [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Java [OR]
    RewriteCond %{HTTP_USER_AGENT} ^User-Agent [OR]
    RewriteCond %{HTTP_USER_AGENT} ^panscient.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteRule ^.* - [F,L]
    
    Code (markup):
     
    qwikad.com, Aug 28, 2015 IP
  5. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #5
    Just be warned, using a massive .htaccess or even doing it in httpd.conf is going to drag down the performance of EVERY file request into the deepest circles of hell. It's NOT a particularly efficient way of dealing with it PARTICULARLY resorting to a regex against the user-agent string.

    This is the type of thing that is better handled via IP lookups and caching and doing it at the firewall level -- assuming linux or some other *nix host that means using iptables.

    The nice thing about blacklisting at the packet level instead of apache level, is it also prevents them from probing ports for other services like SSH, FTP, SMTP, POP3, IMAP, etc, etc.

    It's also flawed to try and use the UA, a lot of the nastier bots don't even report a UA string or use a fake one to report themselves as something else. Faking UA on a request isn't exactly rocket science.
     
    deathshadow, Aug 30, 2015 IP
    jaguar34 and sarahk like this.
  6. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #6
    Oh, and since UA strings can be EASILY faked (just like every other bit of HTTP headers), AND IP addresses are less and less useful as proxy networks like Tor become more and more common, blocking robots in general is becoming more and more difficult.

    I've been playing with the idea of just float out blocking any IP that has a RDNS that reports it as a Tor exit node, but blocking people paranoid about security is... not something I like the idea of.

    It's sad that a tool meant for personal security has become the number one way for crackers to abuse for malware, spambots, and even to mask brute-force password attacks.
     
    deathshadow, Aug 30, 2015 IP
  7. qwikad.com

    qwikad.com Illustrious Member Affiliate Manager

    Messages:
    7,361
    Likes Received:
    1,713
    Best Answers:
    31
    Trophy Points:
    475
    #7
    I mostly protect my sites from email harvesting programs. The point is it's good to block baaaaaad bots. Of course there's no need to block all bots (it's not possible), just the malicious ones like email harvesters, site rippers, theft bots of every kind. All they do is waste CPU resources.
     
    qwikad.com, Aug 30, 2015 IP
  8. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #8
    True enough -- the problem is that going overboard trying to block can often be more damaging to performance than just letting them through -- CRAZY as that sounds. Again though that hinges on traffic numbers and what server tech is in use.

    One thing I often do is reverse dns lookups, and then look up the address block to see who it's assigned to. A good rule of thumb is that if bad or suspicious requests are coming from an address block assigned to a hosting company, just block that ENTIRE hosting company's address range from HTTP, FTP and POP3/IMAP access (but still allow SMTP through or you won't get e-mails from legitimate users at those hosts). Reason being there's no legitimate reason for a generic hosting data center in the ukraine to be making HTTP requests of an English language site.

    At one point I had this forums I was managing where I ended up having to IP level block half the ukraine, two-thirds of china, most of the west coast and interior of Africa, and much of India and Pakistan for the endless hordes of requests from those regions. Laugh was that was a minor gaming website for a pen and paper desktop strategy game -- something I'd not have thought would be such a major target.

    Though at the same time the forums on said site was pushing a thousand posts a day (still managing ~600 or so a day under the new management) so I guess that might make it a pretty good place to try and spam / hijack.

    Scary part was a LOT of that bad traffic was via brute force attempts to login via SSH... Fail2Ban to the rescue on that one. (which is now part of my "to-do" list when setting up a new server)

    If anything it's a balancing act. You have to weigh the penalities of blocking vs. the load of not blocking. Sometimes you are better off mixing techniques too. You want to block an IP address, use IPTables, you have something that is obviously consistently URI or UA related, use a rewriteCond.

    In some ways I liken it to door locks on a car. We ALL know how easy it is to bypass -- any thug with a brick or crim with a slim-jim can bypass it in seconds -- but we still put them on cars and use them religiously... WHY?!? It keeps the honest people out...

    Random hashes in a type="hidden" field in contact forms corresponding to a value stored in the session server-side is a stunning example of that. Makes SO many "off the shelf" spambots fall flat on their face just because they try to re-use the same form values ad-nausea... if they took the time to request a new form each time they can bypass it... but they usually don't bother and move on to some other less secure site instead. Same way a car thief will look for unlocked cars BEFORE trying to break into one.
     
    deathshadow, Aug 30, 2015 IP
  9. sarahk

    sarahk iTamer Staff

    Messages:
    28,875
    Likes Received:
    4,547
    Best Answers:
    123
    Trophy Points:
    665
    #9
    I beg to differ. You'll waste so much time trying to identify if a new bot is likely to be good or bad that it isn't worth the hassle. Focus on the positive aspects of business.
     
    sarahk, Aug 30, 2015 IP
  10. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #10
    I wouldn't go that far. There are some nasties out there you DO need to get out there and block. If you've ever been the target of a DDOS, you'd know what I mean.

    The wolves are out there. Always. Worse, they're usually dressed as sheep.
     
    deathshadow, Aug 30, 2015 IP
    jaguar34 likes this.