I have a few sites on a shared hosting where each has 1GB transfer per month. Those are super low traffic sites and i thought that won't be an issue but, take a look at this: That's just for the first 5 days of this month, and when the last month I put robots.txt like this: # Begin block Bad-Robots from robots.txt User-agent: robot Disallow:/ User-agent: bot Disallow:/ User-agent: spider Disallow:/ User-agent: crawl Disallow:/ User-agent: spider Disallow:/ User-agent: asterias Disallow:/ User-agent: BackDoorBot/1.0 Disallow:/ User-agent: Black Hole Disallow:/ User-agent: BlowFish/1.0 Disallow:/ User-agent: BotALot Disallow:/ User-agent: BuiltBotTough Disallow:/ User-agent: Bullseye/1.0 Disallow:/ User-agent: BunnySlippers Disallow:/ User-agent: Cegbfeieh Disallow:/ User-agent: CheeseBot Disallow:/ User-agent: CherryPicker Disallow:/ User-agent: CherryPickerElite/1.0 Disallow:/ User-agent: CherryPickerSE/1.0 Disallow:/ User-agent: CopyRightCheck Disallow:/ User-agent: cosmos Disallow:/ User-agent: Crescent Disallow:/ User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow:/ User-agent: DittoSpyder Disallow:/ User-agent: EmailCollector Disallow:/ User-agent: EmailSiphon Disallow:/ User-agent: EmailWolf Disallow:/ User-agent: EroCrawler Disallow:/ User-agent: ExtractorPro Disallow:/ User-agent: Foobot Disallow:/ User-agent: Harvest/1.5 Disallow:/ User-agent: hloader Disallow:/ User-agent: httplib Disallow:/ User-agent: humanlinks Disallow:/ User-agent: InfoNaviRobot Disallow:/ User-agent: JennyBot Disallow:/ User-agent: Kenjin Spider Disallow:/ User-agent: Keyword Density/0.9 Disallow:/ User-agent: LexiBot Disallow:/ User-agent: libWeb/clsHTTP Disallow:/ User-agent: LinkextractorPro Disallow:/ User-agent: LinkScan/8.1a Unix Disallow:/ User-agent: LinkWalker Disallow:/ User-agent: LNSpiderguy Disallow:/ User-agent: lwp-trivial Disallow:/ User-agent: lwp-trivial/1.34 Disallow:/ User-agent: Mata Hari Disallow:/ User-agent: Microsoft URL Control - 5.01.4511 Disallow:/ User-agent: Microsoft URL Control - 6.00.8169 Disallow:/ User-agent: MIIxpc Disallow:/ User-agent: MIIxpc/4.2 Disallow:/ User-agent: Mister PiX Disallow:/ User-agent: moget Disallow:/ User-agent: moget/2.1 Disallow:/ User-agent: NetAnts Disallow:/ User-agent: NICErsPRO Disallow:/ User-agent: Offline Explorer Disallow:/ User-agent: Openfind Disallow:/ User-agent: Openfind data gathere Disallow:/ User-agent: ProPowerBot/2.14 Disallow:/ User-agent: ProWebWalker Disallow:/ User-agent: QueryN Metasearch Disallow:/ User-agent: RepoMonkey Disallow:/ User-agent: RepoMonkey Bait & Tackle/v1.01 Disallow:/ User-agent: RMA Disallow:/ User-agent: SiteSnagger Disallow:/ User-agent: SpankBot Disallow:/ User-agent: spanner Disallow:/ User-agent: suzuran Disallow:/ User-agent: Szukacz/1.4 Disallow:/ User-agent: Teleport Disallow:/ User-agent: TeleportPro Disallow:/ User-agent: Telesoft Disallow:/ User-agent: The Intraformant Disallow:/ User-agent: TheNomad Disallow:/ User-agent: TightTwatBot Disallow:/ User-agent: Titan Disallow:/ User-agent: toCrawl/UrlDispatcher Disallow:/ User-agent: True_Robot Disallow:/ User-agent: True_Robot/1.0 Disallow:/ User-agent: turingos Disallow:/ User-agent: URLy Warning Disallow:/ User-agent: VCI Disallow:/ User-agent: VCI WebViewer VCI WebViewer Win32 Disallow:/ User-agent: Web Image Collector Disallow:/ User-agent: WebAuto Disallow:/ User-agent: WebBandit Disallow:/ User-agent: WebBandit/3.50 Disallow:/ User-agent: WebCopier Disallow:/ User-agent: WebEnhancer Disallow:/ User-agent: WebmasterWorldForumBot Disallow:/ User-agent: WebSauger Disallow:/ User-agent: Website Quester Disallow:/ User-agent: Webster Pro Disallow:/ User-agent: WebStripper Disallow:/ User-agent: WebZip Disallow:/ User-agent: WebZip/4.0 Disallow:/ User-agent: Wget Disallow:/ User-agent: Wget/1.5.3 Disallow:/ User-agent: Wget/1.6 Disallow:/ User-agent: WWW-Collector-E Disallow:/ User-agent: Xenu's Disallow:/ User-agent: Xenu's Link Sleuth 1.1c Disallow:/ User-agent: Zeus Disallow:/ User-agent: Zeus 32297 Webster Pro V2.9 Win32 Disallow:/ # Begin Exclusion From Directories from robots.txt Disallow: /cgi-bin/ Code (markup): it's basically blocking all known spider bots but google, bing and yahoo and as you can see robot, bot, spider, crawl are not respecting that at all so few questions here 1. Does anybody know who runs those bots and why they don't respect robots.txt 2. What's up with googlebot consuming 660mb for 5 days? Aren't they supposed to NOT be aggressive like that. There was a video where Matt Cuts explains how they are extra careful to not crawl sites too fast and aggressive since this might cause problems to smaller hosts. 3. if i add the line: Disallow:/ User-agent: *bot since this is the ID of one of the bots, will that also disallow "Googlebot" or in robots.txt * is literally * not a catch all symbol? answering on any of the 3 questions will be appreciated
Because they, as you said, are evil. They want to make money (or something else), so they just ignore it, From a quick search: Check your website for broken links, 404, unnecessary huge images etc... If your site is fine: Googlebot is hungry - no other way around then blocking it Can't answer that for certain, but why not just dissalow ALL other bots, and only allow the ones of your liking? # Allowed robots User-agent: Google Allow: / User-agent: Yahoo Allow: / User-agent: Bing Allow: / # All other robots User-agent: * Disallow: / Code (markup): If you want to really block the bad bots, and you can find out their IP addresses, you can block them via .htaccess Here is a tutorial http://www.thesitewizard.com/apache/block-bots-with-htaccess.shtml Code (markup): Bots are not able to ignore .htaccess files
Hey is this is on a dedicated server you can do this in iptables which is even better as it will drop all their connection attempts. The way to do is is STRING match. Like this: iptables -A INPUT -m string --algo bm --string "BADBOTUSERAGENT" -j DROP That will work IF they are giving correct user agent but you can also block lib-www-perl and just about anything that can be used to make automated requests. However, bandwidth is cheaper these days and usually the average webmaster wont mind any bots crawling the site. I guess it depends what kind and what they are doing. I remember seeing posts like these back in 2004-5 but not much since then as bandwidth has gotten a lot cheaper.