I am hoping someone can help me find the list of bad robots to add to the robots.txt I don't remember if it was at this forum or at SEOchat, but someone posted a list of bad robots to dissallow, some of which are e-mail harvesters, etc... I have search, but can not find. Thanks
well, think about it for a second, do you really think the "bad" bots are going to go to a lot of trouble to identify themselves?
This thread provides an interesting example of how to use and not use the hyphen to avoid confusion. When I saw the thread title, I took it to be a report of a defective list of robots; as it turns out, the true topic is not a "bad robot list" but a "bad-robot list". (From a sometime poster to alt.english.usage)
What is the point the really bad robots will not respect your robot.txt any may not even bother reading it. Compliance with robots.txt is volintary.
Trying to think it through, it seems to me that you could also block robots by blocking their User Agents. Or course, that can also be worked around! However, just because it can be worked around doesn't mean we shouldn't force them to work-around it. I'm all for requiring them to do as much extra work as possible before they can annoy me. Or, to be even sneakier, you could use mod_rewrite. This would make it more difficult for them to tell that their User Agent had been blocked. You could send their User Agent off on a wild goose chase with a RewriteCond %{HTTP_USER_AGENT} statement.
That has a nice ring to it... Like adding a spammers company's e-mail to another spammers company's list. Heheheheha (did I say I do this??)
Here is the robots.txt that I use New Computer.ca Robots.txt I have compiled it from a few different sources over time. If you are having a problem with one or two particulars, block their ip.
... yes, that's an absurdly excessive robots.txt. you do realize mozilla is a CLIENT, not a bot, right? you're blocking regular users, even. just based on their browser. well, I guess it won't affect most people- but if you went to more extreme measures to ban those same robots (ie handling pages based on what it idents as), you'd be causing a whole lot of people a whole lot of trouble. honestly that robots.txt is really excessive though.
hmmmm, pretty sure that my robots text is NOT blocking Mozilla browsers, actually, I would be inclined to say that I am 100% sure, as I use Mozilla. As for 'too many', tell me which ones you feel should not be blocked...
of course it don't block mozilla, as it doesn't read the robots.tx file. But you are blocking msn, etc....
please tell me if any line of the list should be removed. I don't know most of these bots, many are self explanitory, but if one should not be blocked, please let me know. http://www.northwestgifts.com/robots.txt
Here's a list of IP's to block using .htaccess http://bots.pcpropertymanager.com/modules.php?name=BS_IPCheats and take a look at the bots I've tagged as mail harvesters on http://bots.pcpropertymanager.com/bsp-5.html
a bot who steals e-mails for spam purposes, a bot that doesn't follow the robots rules (like fetching tons of page in a second) or a bot that is buggy (I had already one spidering my website that couldn't parse urls like /xpto, so it was requesting urls like: /xpto, /xpto/xpto /xpto/xpto/xpto ..... and endless loop. I had to ban it!)
I'd add to the list bots just trying to get listed on your top 10 referrers list, or to build PR by getting linked from your stats. Check out: http://sarahk.pcpropertymanager.com/blogspam.php I had my stats open because I'd done some work on the open source package and it was available as a demo. I've had to close it partly because Google was indexing and I had clickable links for the referrer info. I could have just used robots.txt but I had other reasons too. Sites that were spamming me included the standard adult stuff but also a band trying to promote themselves and a Belgian cafe. Go figure. The code to do this type of spamming is really simple, so we'll be seeing more of it.
New Computer There's a lot of good advice in this post. You cannot effectively use robots.txt to block anything unless the spider algorithm requests the file and respects the rules. "Bad" bots will just ignore the file. The best way to effectively block them is through the .htaccess file or perhaps through a firewall.
. You may be confusing me with someone else. I never said anything about using ONLY the robots.txt to block unwanted bots. .htaccess is of course the best way to ensure that the 'unwanted' stay out. PS: They now change IP's on a regular basis, at least that is what my logs tell me about 2 in particular.
Better late than never (I keot the format for the file so you xan copy&paste them): User-agent: larbin Disallow: / User-agent: b2w/0.1 Disallow: / User-agent: Copernic Disallow: / User-agent: psbot Disallow: / User-agent: Python-urllib Disallow: / User-agent: Googlebot-Image Disallow: / User-agent: URL_Spider_Pro Disallow: / User-agent: CherryPicker Disallow: / User-agent: EmailCollector Disallow: / User-agent: EmailSiphon Disallow: / User-agent: WebBandit Disallow: / User-agent: EmailWolf Disallow: / User-agent: ExtractorPro Disallow: / User-agent: CopyRightCheck Disallow: / User-agent: Crescent Disallow: / User-agent: SiteSnagger Disallow: / User-agent: ProWebWalker Disallow: / User-agent: CheeseBot Disallow: / User-agent: LNSpiderguy Disallow: / User-agent: Alexibot Disallow: / User-agent: Teleport Disallow: / User-agent: TeleportPro Disallow: / User-agent: MIIxpc Disallow: / User-agent: Telesoft Disallow: / User-agent: Website Quester Disallow: / User-agent: WebZip Disallow: / User-agent: moget/2.1 Disallow: / User-agent: WebZip/4.0 Disallow: / User-agent: WebStripper Disallow: / User-agent: WebSauger Disallow: / User-agent: WebCopier Disallow: / User-agent: NetAnts Disallow: / User-agent: Mister PiX Disallow: / User-agent: WebAuto Disallow: / User-agent: TheNomad Disallow: / User-agent: WWW-Collector-E Disallow: / User-agent: RMA Disallow: / User-agent: libWeb/clsHTTP Disallow: / User-agent: asterias Disallow: / User-agent: httplib Disallow: / User-agent: turingos Disallow: / User-agent: spanner Disallow: / User-agent: InfoNaviRobot Disallow: / User-agent: Harvest/1.5 Disallow: / User-agent: Bullseye/1.0 Disallow: / User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) Disallow: / User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow: / User-agent: CherryPickerSE/1.0 Disallow: / User-agent: CherryPickerElite/1.0 Disallow: / User-agent: WebBandit/3.50 Disallow: / User-agent: NICErsPRO Disallow: / User-agent: Microsoft URL Control - 5.01.4511 Disallow: / User-agent: Not Your Business! Disallow: / User-agent: Hidden-Referrer Disallow: / User-agent: DittoSpyder Disallow: / User-agent: Foobot Disallow: / User-agent: WebmasterWorldForumBot Disallow: / User-agent: SpankBot Disallow: / User-agent: BotALot Disallow: / User-agent: lwp-trivial/1.34 Disallow: / User-agent: lwp-trivial Disallow: / User-agent: BunnySlippers Disallow: / User-agent: Microsoft URL Control - 6.00.8169 Disallow: / User-agent: URLy Warning Disallow: / User-agent: Wget/1.6 Disallow: / User-agent: Wget/1.5.3 Disallow: / User-agent: Wget Disallow: / User-agent: LinkWalker Disallow: / User-agent: cosmos Disallow: / User-agent: moget Disallow: / User-agent: hloader Disallow: / User-agent: humanlinks Disallow: / User-agent: LinkextractorPro Disallow: / User-agent: Offline Explorer Disallow: / User-agent: Mata Hari Disallow: / User-agent: LexiBot Disallow: / User-agent: Web Image Collector Disallow: / User-agent: The Intraformant Disallow: / User-agent: True_Robot/1.0 Disallow: / User-agent: True_Robot Disallow: / User-agent: BlowFish/1.0 Disallow: / User-agent: JennyBot Disallow: / User-agent: MIIxpc/4.2 Disallow: / User-agent: BuiltBotTough Disallow: / User-agent: ProPowerBot/2.14 Disallow: / User-agent: BackDoorBot/1.0 Disallow: / User-agent: toCrawl/UrlDispatcher Disallow: / User-agent: WebEnhancer Disallow: / User-agent: suzuran Disallow: / User-agent: TightTwatBot Disallow: / User-agent: VCI WebViewer VCI WebViewer Win32 Disallow: / User-agent: VCI Disallow: / User-agent: Szukacz/1.4 Disallow: / User-agent: QueryN Metasearch Disallow: / User-agent: Openfind data gathere Disallow: / User-agent: Openfind Disallow: / User-agent: Xenu's Link Sleuth 1.1c Disallow: / User-agent: Xenu's Disallow: / User-agent: Zeus Disallow: / User-agent: RepoMonkey Bait & Tackle/v1.01 Disallow: / User-agent: RepoMonkey Disallow: / User-agent: Microsoft URL Control Disallow: / User-agent: Openbot Disallow: / User-agent: URL Control Disallow: / User-agent: Zeus Link Scout Disallow: / User-agent: Zeus 32297 Webster Pro V2.9 Win32 Disallow: / User-agent: Webster Pro Disallow: / User-agent: EroCrawler Disallow: / User-agent: LinkScan/8.1a Unix Disallow: / User-agent: Keyword Density/0.9 Disallow: / User-agent: Kenjin Spider Disallow: / User-agent: Iron33/1.0.2 Disallow: / User-agent: Bookmark search tool Disallow: / User-agent: GetRight/4.2 Disallow: / User-agent: FairAd Client Disallow: / User-agent: Gaisbot Disallow: / User-agent: Aqua_Products Disallow: / User-agent: Radiation Retriever 1.1 Disallow: / User-agent: WebmasterWorld Extractor Disallow: / User-agent: Flaming AttackBot Disallow: / User-agent: Oracle Ultra Search Disallow: / User-agent: MSIECrawler Disallow: / User-agent: PerMan Disallow: / User-agent: searchpreview Disallow: /