Evil bots consuming all my sites traffic even after i disallowed them in robots.txt

Discussion in 'Site & Server Administration' started by Fking, Nov 6, 2012.

  1. #1
    I have a few sites on a shared hosting where each has 1GB transfer per month. Those are super low traffic sites and i thought that won't be an issue but, take a look at this:
    [​IMG]

    That's just for the first 5 days of this month, and when the last month I put robots.txt like this:

    # Begin block Bad-Robots from robots.txt
    User-agent: robot
    Disallow:/
    User-agent: bot
    Disallow:/
    User-agent: spider
    Disallow:/
    User-agent: crawl
    Disallow:/
    User-agent: spider
    Disallow:/
    User-agent: asterias
    Disallow:/
    User-agent: BackDoorBot/1.0
    Disallow:/
    User-agent: Black Hole
    Disallow:/
    User-agent: BlowFish/1.0
    Disallow:/
    User-agent: BotALot
    Disallow:/
    User-agent: BuiltBotTough
    Disallow:/
    User-agent: Bullseye/1.0
    Disallow:/
    User-agent: BunnySlippers
    Disallow:/
    User-agent: Cegbfeieh
    Disallow:/
    User-agent: CheeseBot
    Disallow:/
    User-agent: CherryPicker
    Disallow:/
    User-agent: CherryPickerElite/1.0
    Disallow:/
    User-agent: CherryPickerSE/1.0
    Disallow:/
    User-agent: CopyRightCheck
    Disallow:/
    User-agent: cosmos
    Disallow:/
    User-agent: Crescent
    Disallow:/
    User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
    Disallow:/
    User-agent: DittoSpyder
    Disallow:/
    User-agent: EmailCollector
    Disallow:/
    User-agent: EmailSiphon
    Disallow:/
    User-agent: EmailWolf
    Disallow:/
    User-agent: EroCrawler
    Disallow:/
    User-agent: ExtractorPro
    Disallow:/
    User-agent: Foobot
    Disallow:/
    User-agent: Harvest/1.5
    Disallow:/
    User-agent: hloader
    Disallow:/
    User-agent: httplib
    Disallow:/
    User-agent: humanlinks
    Disallow:/
    User-agent: InfoNaviRobot
    Disallow:/
    User-agent: JennyBot
    Disallow:/
    User-agent: Kenjin Spider
    Disallow:/
    User-agent: Keyword Density/0.9
    Disallow:/
    User-agent: LexiBot
    Disallow:/
    User-agent: libWeb/clsHTTP
    Disallow:/
    User-agent: LinkextractorPro
    Disallow:/
    User-agent: LinkScan/8.1a Unix
    Disallow:/
    User-agent: LinkWalker
    Disallow:/
    User-agent: LNSpiderguy
    Disallow:/
    User-agent: lwp-trivial
    Disallow:/
    User-agent: lwp-trivial/1.34
    Disallow:/
    User-agent: Mata Hari
    Disallow:/
    User-agent: Microsoft URL Control - 5.01.4511
    Disallow:/
    User-agent: Microsoft URL Control - 6.00.8169
    Disallow:/
    User-agent: MIIxpc
    Disallow:/
    User-agent: MIIxpc/4.2
    Disallow:/
    User-agent: Mister PiX
    Disallow:/
    User-agent: moget
    Disallow:/
    User-agent: moget/2.1
    Disallow:/
    User-agent: NetAnts
    Disallow:/
    User-agent: NICErsPRO
    Disallow:/
    User-agent: Offline Explorer
    Disallow:/
    User-agent: Openfind
    Disallow:/
    User-agent: Openfind data gathere
    Disallow:/
    User-agent: ProPowerBot/2.14
    Disallow:/
    User-agent: ProWebWalker
    Disallow:/
    User-agent: QueryN Metasearch
    Disallow:/
    User-agent: RepoMonkey
    Disallow:/
    User-agent: RepoMonkey Bait & Tackle/v1.01
    Disallow:/
    User-agent: RMA
    Disallow:/
    User-agent: SiteSnagger
    Disallow:/
    User-agent: SpankBot
    Disallow:/
    User-agent: spanner
    Disallow:/
    User-agent: suzuran
    Disallow:/
    User-agent: Szukacz/1.4
    Disallow:/
    User-agent: Teleport
    Disallow:/
    User-agent: TeleportPro
    Disallow:/
    User-agent: Telesoft
    Disallow:/
    User-agent: The Intraformant
    Disallow:/
    User-agent: TheNomad
    Disallow:/
    User-agent: TightTwatBot
    Disallow:/
    User-agent: Titan
    Disallow:/
    User-agent: toCrawl/UrlDispatcher
    Disallow:/
    User-agent: True_Robot
    Disallow:/
    User-agent: True_Robot/1.0
    Disallow:/
    User-agent: turingos
    Disallow:/
    User-agent: URLy Warning
    Disallow:/
    User-agent: VCI
    Disallow:/
    User-agent: VCI WebViewer VCI WebViewer Win32
    Disallow:/
    User-agent: Web Image Collector
    Disallow:/
    User-agent: WebAuto
    Disallow:/
    User-agent: WebBandit
    Disallow:/
    User-agent: WebBandit/3.50
    Disallow:/
    User-agent: WebCopier
    Disallow:/
    User-agent: WebEnhancer
    Disallow:/
    User-agent: WebmasterWorldForumBot
    Disallow:/
    User-agent: WebSauger
    Disallow:/
    User-agent: Website Quester
    Disallow:/
    User-agent: Webster Pro
    Disallow:/
    User-agent: WebStripper
    Disallow:/
    User-agent: WebZip
    Disallow:/
    User-agent: WebZip/4.0
    Disallow:/
    User-agent: Wget
    Disallow:/
    User-agent: Wget/1.5.3
    Disallow:/
    User-agent: Wget/1.6
    Disallow:/
    User-agent: WWW-Collector-E
    Disallow:/
    User-agent: Xenu's
    Disallow:/
    User-agent: Xenu's Link Sleuth 1.1c
    Disallow:/
    User-agent: Zeus
    Disallow:/
    User-agent: Zeus 32297 Webster Pro V2.9 Win32
    Disallow:/
    # Begin Exclusion From Directories from robots.txt
    Disallow: /cgi-bin/
    Code (markup):
    it's basically blocking all known spider bots but google, bing and yahoo
    and as you can see
    robot, bot, spider, crawl are not respecting that at all

    so few questions here

    1. Does anybody know who runs those bots and why they don't respect robots.txt
    2. What's up with googlebot consuming 660mb for 5 days? Aren't they supposed to NOT be aggressive like that. There was a video where Matt Cuts explains how they are extra careful to not crawl sites too fast and aggressive since this might cause problems to smaller hosts.
    3. if i add the line:
    Disallow:/
    User-agent: *bot

    since this is the ID of one of the bots, will that also disallow "Googlebot" or in robots.txt * is literally * not a catch all symbol?


    answering on any of the 3 questions will be appreciated :)
     
    Solved! View solution.
    Fking, Nov 6, 2012 IP
  2. #2
    Because they, as you said, are evil. They want to make money (or something else), so they just ignore it,

    From a quick search: Check your website for broken links, 404, unnecessary huge images etc...
    If your site is fine: Googlebot is hungry - no other way around then blocking it :p


    Can't answer that for certain, but why not just dissalow ALL other bots, and only allow the ones of your liking?

    
    # Allowed robots
    User-agent: Google
    Allow: /
    
    User-agent: Yahoo
    Allow: /
    
    User-agent: Bing
    Allow: /
    
    # All other robots
    User-agent: *
    Disallow: /
    
    Code (markup):

    If you want to really block the bad bots, and you can find out their IP addresses, you can block them via .htaccess
    Here is a tutorial
    
    http://www.thesitewizard.com/apache/block-bots-with-htaccess.shtml
    
    Code (markup):
    Bots are not able to ignore .htaccess files
     
    GMF, Nov 6, 2012 IP
  3. Fking

    Fking Active Member

    Messages:
    257
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #3
    all your suggestions are excellent, thank you! :)
     
    Fking, Nov 6, 2012 IP
  4. blockdos

    blockdos Active Member

    Messages:
    96
    Likes Received:
    0
    Best Answers:
    3
    Trophy Points:
    71
    #4
    Hey is this is on a dedicated server you can do this in iptables which is even better as it will drop all their connection attempts. The way to do is is STRING match. Like this:
    iptables -A INPUT -m string --algo bm --string "BADBOTUSERAGENT" -j DROP

    That will work IF they are giving correct user agent but you can also block lib-www-perl and just about anything that can be used to make automated requests.

    However, bandwidth is cheaper these days and usually the average webmaster wont mind any bots crawling the site. I guess it depends what kind and what they are doing. I remember seeing posts like these back in 2004-5 but not much since then as bandwidth has gotten a lot cheaper.
     
    blockdos, Nov 6, 2012 IP