Unblock Myspace - Discount Magazine Subscriptions - Loans - Car Finance - Advertising

PDA

View Full Version : bad robot list for robots.txt


debunked
Jun 7th 2004, 2:01 pm
I am hoping someone can help me find the list of bad robots to add to the robots.txt I don't remember if it was at this forum or at SEOchat, but someone posted a list of bad robots to dissallow, some of which are e-mail harvesters, etc...

I have search, but can not find.
Thanks

disgust
Jun 7th 2004, 3:35 pm
well, think about it for a second, do you really think the "bad" bots are going to go to a lot of trouble to identify themselves?

digitalpoint
Jun 7th 2004, 6:24 pm
That's what I was going to say. :)

Owlcroft
Jun 7th 2004, 6:52 pm
This thread provides an interesting example of how to use and not use the hyphen to avoid confusion.

When I saw the thread title, I took it to be a report of a defective list of robots; as it turns out, the true topic is not a "bad robot list" but a "bad-robot list".

(From a sometime poster to alt.english.usage)

mushroom
Jun 7th 2004, 7:57 pm
What is the point the really bad robots will not respect your robot.txt any may not even bother reading it.

Compliance with robots.txt is volintary.

Will.Spencer
Jun 7th 2004, 10:12 pm
Trying to think it through, it seems to me that you could also block robots by blocking their User Agents (http://www.internet-search-engines-faq.com/prevent-web-site-download.shtml).

Or course, that can also be worked around!

However, just because it can be worked around doesn't mean we shouldn't force them to work-around it. I'm all for requiring them to do as much extra work as possible before they can annoy me.

Or, to be even sneakier, you could use mod_rewrite. This would make it more difficult for them to tell that their User Agent had been blocked. You could send their User Agent off on a wild goose chase with a RewriteCond %{HTTP_USER_AGENT} statement.

debunked
Jun 8th 2004, 11:21 am
Trying to think it through, it seems to me that you could also block robots by blocking their User Agents (http://www.internet-search-engines-faq.com/prevent-web-site-download.shtml).

Or course, that can also be worked around!

However, just because it can be worked around doesn't mean we shouldn't force them to work-around it. I'm all for requiring them to do as much extra work as possible before they can annoy me.

Or, to be even sneakier, you could use mod_rewrite. This would make it more difficult for them to tell that their User Agent had been blocked. You could send their User Agent off on a wild goose chase with a RewriteCond %{HTTP_USER_AGENT} statement.

That has a nice ring to it... Like adding a spammers company's e-mail to another spammers company's list. Heheheheha (did I say I do this??)

NewComputer
Jun 12th 2004, 9:09 am
I am hoping someone can help me find the list of bad robots to add to the robots.txt I don't remember if it was at this forum or at SEOchat, but someone posted a list of bad robots to dissallow, some of which are e-mail harvesters, etc...

I have search, but can not find.
Thanks

Here is the robots.txt that I use New Computer.ca Robots.txt (http://www.newcomputer.ca/robots.txt)

I have compiled it from a few different sources over time. If you are having a problem with one or two particulars, block their ip.

nlopes
Jun 14th 2004, 5:11 am
Your robots.txt file is too excessive! You have blocked tobots that you shouldn't!

disgust
Jun 14th 2004, 5:32 am
... yes, that's an absurdly excessive robots.txt.

you do realize mozilla is a CLIENT, not a bot, right? you're blocking regular users, even. just based on their browser.

well, I guess it won't affect most people- but if you went to more extreme measures to ban those same robots (ie handling pages based on what it idents as), you'd be causing a whole lot of people a whole lot of trouble.

honestly that robots.txt is really excessive though. :(

NewComputer
Jun 14th 2004, 5:57 am
hmmmm, pretty sure that my robots text is NOT blocking Mozilla browsers, actually, I would be inclined to say that I am 100% sure, as I use Mozilla. As for 'too many', tell me which ones you feel should not be blocked...

nlopes
Jun 14th 2004, 6:17 am
of course it don't block mozilla, as it doesn't read the robots.tx file.

But you are blocking msn, etc....

debunked
Jun 14th 2004, 9:33 am
please tell me if any line of the list should be removed. I don't know most of these bots, many are self explanitory, but if one should not be blocked, please let me know.

http://www.northwestgifts.com/robots.txt

sarahk
Jun 14th 2004, 12:44 pm
Here's a list of IP's to block using .htaccess
http://bots.pcpropertymanager.com/modules.php?name=BS_IPCheats

and take a look at the bots I've tagged as mail harvesters on http://bots.pcpropertymanager.com/bsp-5.html

THT
Jun 14th 2004, 1:55 pm
what defines a bot as a "bad bot"?

nlopes
Jun 15th 2004, 12:32 am
a bot who steals e-mails for spam purposes, a bot that doesn't follow the robots rules (like fetching tons of page in a second) or a bot that is buggy (I had already one spidering my website that couldn't parse urls like /xpto, so it was requesting urls like: /xpto, /xpto/xpto /xpto/xpto/xpto ..... and endless loop. I had to ban it!)

sarahk
Jun 15th 2004, 12:40 am
I'd add to the list bots just trying to get listed on your top 10 referrers list, or to build PR by getting linked from your stats. Check out: http://sarahk.pcpropertymanager.com/blogspam.php

I had my stats open because I'd done some work on the open source package and it was available as a demo. I've had to close it partly because Google was indexing and I had clickable links for the referrer info. I could have just used robots.txt but I had other reasons too. Sites that were spamming me included the standard adult stuff but also a band trying to promote themselves and a Belgian cafe. Go figure.

The code to do this type of spamming is really simple, so we'll be seeing more of it.

TechEvangelist
Jun 15th 2004, 6:56 am
New Computer

There's a lot of good advice in this post. You cannot effectively use robots.txt to block anything unless the spider algorithm requests the file and respects the rules. "Bad" bots will just ignore the file.

The best way to effectively block them is through the .htaccess file or perhaps through a firewall.

NewComputer
Jun 15th 2004, 7:47 am
You cannot effectively use robots.txt to block anything unless the spider algorithm requests the file and respects the rules. "Bad" bots will just ignore the file. .

You may be confusing me with someone else. I never said anything about using ONLY the robots.txt to block unwanted bots. .htaccess is of course the best way to ensure that the 'unwanted' stay out.

PS: They now change IP's on a regular basis, at least that is what my logs tell me about 2 in particular.

Christine8
Jun 9th 2006, 1:27 pm
Better late than never (I keot the format for the file so you xan copy&paste them):

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Alexibot
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: Not Your Business!
Disallow: /

User-agent: Hidden-Referrer
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: WebmasterWorldForumBot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: TightTwatBot
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Openfind data gathere
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: WebmasterWorld Extractor
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

User-agent: Oracle Ultra Search
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: PerMan
Disallow: /

User-agent: searchpreview
Disallow: /

zzb
Sep 3rd 2007, 10:23 am
Here is a technique that has been know to work pretty well. Takes a bit to set up but it TRAPS bots that do not respect or check for the robots.txt file.

If this is done properly you should not need to have a ridiculously long robots.txt file.

http://danielwebb.us/software/bot-trap/


Good post folks.... there is nothing more frustrating to see a robot on your site sucking up bandwidth and find out it was created in some college computer science course as a tutorial !!

-- ZZ