bad robot list for robots.txt

Discussion in 'Apache' started by debunked, Jun 7, 2004.

  1. #1
    I am hoping someone can help me find the list of bad robots to add to the robots.txt I don't remember if it was at this forum or at SEOchat, but someone posted a list of bad robots to dissallow, some of which are e-mail harvesters, etc...

    I have search, but can not find.
    Thanks
     
    debunked, Jun 7, 2004 IP
    Will.Spencer likes this.
  2. disgust

    disgust Guest

    Messages:
    2,417
    Likes Received:
    133
    Best Answers:
    0
    Trophy Points:
    0
    #2
    well, think about it for a second, do you really think the "bad" bots are going to go to a lot of trouble to identify themselves?
     
    disgust, Jun 7, 2004 IP
  3. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,334
    Likes Received:
    2,613
    Best Answers:
    462
    Trophy Points:
    710
    Digital Goods:
    29
    #3
    That's what I was going to say. :)
     
    digitalpoint, Jun 7, 2004 IP
  4. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #4
    This thread provides an interesting example of how to use and not use the hyphen to avoid confusion.

    When I saw the thread title, I took it to be a report of a defective list of robots; as it turns out, the true topic is not a "bad robot list" but a "bad-robot list".

    (From a sometime poster to alt.english.usage)
     
    Owlcroft, Jun 7, 2004 IP
  5. mushroom

    mushroom Peon

    Messages:
    369
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    0
    #5
    What is the point the really bad robots will not respect your robot.txt any may not even bother reading it.

    Compliance with robots.txt is volintary.
     
    mushroom, Jun 7, 2004 IP
  6. Will.Spencer

    Will.Spencer NetBuilder

    Messages:
    14,789
    Likes Received:
    1,040
    Best Answers:
    0
    Trophy Points:
    375
    #6
    Trying to think it through, it seems to me that you could also block robots by blocking their User Agents.

    Or course, that can also be worked around!

    However, just because it can be worked around doesn't mean we shouldn't force them to work-around it. I'm all for requiring them to do as much extra work as possible before they can annoy me.

    Or, to be even sneakier, you could use mod_rewrite. This would make it more difficult for them to tell that their User Agent had been blocked. You could send their User Agent off on a wild goose chase with a RewriteCond %{HTTP_USER_AGENT} statement.
     
    Will.Spencer, Jun 7, 2004 IP
  7. debunked

    debunked Prominent Member

    Messages:
    7,298
    Likes Received:
    416
    Best Answers:
    0
    Trophy Points:
    310
    #7
    That has a nice ring to it... Like adding a spammers company's e-mail to another spammers company's list. Heheheheha (did I say I do this??)
     
    debunked, Jun 8, 2004 IP
    Will.Spencer and bogart like this.
  8. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #8
    Here is the robots.txt that I use New Computer.ca Robots.txt

    I have compiled it from a few different sources over time. If you are having a problem with one or two particulars, block their ip.
     
    NewComputer, Jun 12, 2004 IP
  9. nlopes

    nlopes Guest

    Messages:
    103
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Your robots.txt file is too excessive! You have blocked tobots that you shouldn't!
     
    nlopes, Jun 14, 2004 IP
  10. disgust

    disgust Guest

    Messages:
    2,417
    Likes Received:
    133
    Best Answers:
    0
    Trophy Points:
    0
    #10
    ... yes, that's an absurdly excessive robots.txt.

    you do realize mozilla is a CLIENT, not a bot, right? you're blocking regular users, even. just based on their browser.

    well, I guess it won't affect most people- but if you went to more extreme measures to ban those same robots (ie handling pages based on what it idents as), you'd be causing a whole lot of people a whole lot of trouble.

    honestly that robots.txt is really excessive though. :(
     
    disgust, Jun 14, 2004 IP
  11. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #11
    hmmmm, pretty sure that my robots text is NOT blocking Mozilla browsers, actually, I would be inclined to say that I am 100% sure, as I use Mozilla. As for 'too many', tell me which ones you feel should not be blocked...
     
    NewComputer, Jun 14, 2004 IP
  12. nlopes

    nlopes Guest

    Messages:
    103
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    of course it don't block mozilla, as it doesn't read the robots.tx file.

    But you are blocking msn, etc....
     
    nlopes, Jun 14, 2004 IP
  13. debunked

    debunked Prominent Member

    Messages:
    7,298
    Likes Received:
    416
    Best Answers:
    0
    Trophy Points:
    310
    #13
    please tell me if any line of the list should be removed. I don't know most of these bots, many are self explanitory, but if one should not be blocked, please let me know.

    http://www.northwestgifts.com/robots.txt
     
    debunked, Jun 14, 2004 IP
  14. sarahk

    sarahk iTamer Staff

    Messages:
    28,808
    Likes Received:
    4,535
    Best Answers:
    123
    Trophy Points:
    665
    #14
    sarahk, Jun 14, 2004 IP
  15. THT

    THT Peon

    Messages:
    686
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #15
    what defines a bot as a "bad bot"?
     
    THT, Jun 14, 2004 IP
  16. nlopes

    nlopes Guest

    Messages:
    103
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #16
    a bot who steals e-mails for spam purposes, a bot that doesn't follow the robots rules (like fetching tons of page in a second) or a bot that is buggy (I had already one spidering my website that couldn't parse urls like /xpto, so it was requesting urls like: /xpto, /xpto/xpto /xpto/xpto/xpto ..... and endless loop. I had to ban it!)
     
    nlopes, Jun 15, 2004 IP
  17. sarahk

    sarahk iTamer Staff

    Messages:
    28,808
    Likes Received:
    4,535
    Best Answers:
    123
    Trophy Points:
    665
    #17
    I'd add to the list bots just trying to get listed on your top 10 referrers list, or to build PR by getting linked from your stats. Check out: http://sarahk.pcpropertymanager.com/blogspam.php

    I had my stats open because I'd done some work on the open source package and it was available as a demo. I've had to close it partly because Google was indexing and I had clickable links for the referrer info. I could have just used robots.txt but I had other reasons too. Sites that were spamming me included the standard adult stuff but also a band trying to promote themselves and a Belgian cafe. Go figure.

    The code to do this type of spamming is really simple, so we'll be seeing more of it.
     
    sarahk, Jun 15, 2004 IP
  18. TechEvangelist

    TechEvangelist Guest

    Messages:
    919
    Likes Received:
    140
    Best Answers:
    0
    Trophy Points:
    133
    #18
    New Computer

    There's a lot of good advice in this post. You cannot effectively use robots.txt to block anything unless the spider algorithm requests the file and respects the rules. "Bad" bots will just ignore the file.

    The best way to effectively block them is through the .htaccess file or perhaps through a firewall.
     
    TechEvangelist, Jun 15, 2004 IP
  19. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #19
    .

    You may be confusing me with someone else. I never said anything about using ONLY the robots.txt to block unwanted bots. .htaccess is of course the best way to ensure that the 'unwanted' stay out.

    PS: They now change IP's on a regular basis, at least that is what my logs tell me about 2 in particular.
     
    NewComputer, Jun 15, 2004 IP
  20. Christine8

    Christine8 Peon

    Messages:
    257
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #20
    Better late than never (I keot the format for the file so you xan copy&paste them):

    User-agent: larbin
    Disallow: /

    User-agent: b2w/0.1
    Disallow: /

    User-agent: Copernic
    Disallow: /

    User-agent: psbot
    Disallow: /

    User-agent: Python-urllib
    Disallow: /

    User-agent: Googlebot-Image
    Disallow: /

    User-agent: URL_Spider_Pro
    Disallow: /

    User-agent: CherryPicker
    Disallow: /

    User-agent: EmailCollector
    Disallow: /

    User-agent: EmailSiphon
    Disallow: /

    User-agent: WebBandit
    Disallow: /

    User-agent: EmailWolf
    Disallow: /

    User-agent: ExtractorPro
    Disallow: /

    User-agent: CopyRightCheck
    Disallow: /

    User-agent: Crescent
    Disallow: /

    User-agent: SiteSnagger
    Disallow: /

    User-agent: ProWebWalker
    Disallow: /

    User-agent: CheeseBot
    Disallow: /

    User-agent: LNSpiderguy
    Disallow: /

    User-agent: Alexibot
    Disallow: /

    User-agent: Teleport
    Disallow: /

    User-agent: TeleportPro
    Disallow: /

    User-agent: MIIxpc
    Disallow: /

    User-agent: Telesoft
    Disallow: /

    User-agent: Website Quester
    Disallow: /

    User-agent: WebZip
    Disallow: /

    User-agent: moget/2.1
    Disallow: /

    User-agent: WebZip/4.0
    Disallow: /

    User-agent: WebStripper
    Disallow: /

    User-agent: WebSauger
    Disallow: /

    User-agent: WebCopier
    Disallow: /

    User-agent: NetAnts
    Disallow: /

    User-agent: Mister PiX
    Disallow: /

    User-agent: WebAuto
    Disallow: /

    User-agent: TheNomad
    Disallow: /

    User-agent: WWW-Collector-E
    Disallow: /

    User-agent: RMA
    Disallow: /

    User-agent: libWeb/clsHTTP
    Disallow: /

    User-agent: asterias
    Disallow: /

    User-agent: httplib
    Disallow: /

    User-agent: turingos
    Disallow: /

    User-agent: spanner
    Disallow: /

    User-agent: InfoNaviRobot
    Disallow: /

    User-agent: Harvest/1.5
    Disallow: /

    User-agent: Bullseye/1.0
    Disallow: /

    User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
    Disallow: /

    User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
    Disallow: /

    User-agent: CherryPickerSE/1.0
    Disallow: /

    User-agent: CherryPickerElite/1.0
    Disallow: /

    User-agent: WebBandit/3.50
    Disallow: /

    User-agent: NICErsPRO
    Disallow: /

    User-agent: Microsoft URL Control - 5.01.4511
    Disallow: /

    User-agent: Not Your Business!
    Disallow: /

    User-agent: Hidden-Referrer
    Disallow: /

    User-agent: DittoSpyder
    Disallow: /

    User-agent: Foobot
    Disallow: /

    User-agent: WebmasterWorldForumBot
    Disallow: /

    User-agent: SpankBot
    Disallow: /

    User-agent: BotALot
    Disallow: /

    User-agent: lwp-trivial/1.34
    Disallow: /

    User-agent: lwp-trivial
    Disallow: /

    User-agent: BunnySlippers
    Disallow: /

    User-agent: Microsoft URL Control - 6.00.8169
    Disallow: /

    User-agent: URLy Warning
    Disallow: /

    User-agent: Wget/1.6
    Disallow: /

    User-agent: Wget/1.5.3
    Disallow: /

    User-agent: Wget
    Disallow: /

    User-agent: LinkWalker
    Disallow: /

    User-agent: cosmos
    Disallow: /

    User-agent: moget
    Disallow: /

    User-agent: hloader
    Disallow: /

    User-agent: humanlinks
    Disallow: /

    User-agent: LinkextractorPro
    Disallow: /

    User-agent: Offline Explorer
    Disallow: /

    User-agent: Mata Hari
    Disallow: /

    User-agent: LexiBot
    Disallow: /

    User-agent: Web Image Collector
    Disallow: /

    User-agent: The Intraformant
    Disallow: /

    User-agent: True_Robot/1.0
    Disallow: /

    User-agent: True_Robot
    Disallow: /

    User-agent: BlowFish/1.0
    Disallow: /

    User-agent: JennyBot
    Disallow: /

    User-agent: MIIxpc/4.2
    Disallow: /

    User-agent: BuiltBotTough
    Disallow: /

    User-agent: ProPowerBot/2.14
    Disallow: /

    User-agent: BackDoorBot/1.0
    Disallow: /

    User-agent: toCrawl/UrlDispatcher
    Disallow: /

    User-agent: WebEnhancer
    Disallow: /

    User-agent: suzuran
    Disallow: /

    User-agent: TightTwatBot
    Disallow: /

    User-agent: VCI WebViewer VCI WebViewer Win32
    Disallow: /

    User-agent: VCI
    Disallow: /

    User-agent: Szukacz/1.4
    Disallow: /

    User-agent: QueryN Metasearch
    Disallow: /

    User-agent: Openfind data gathere
    Disallow: /

    User-agent: Openfind
    Disallow: /

    User-agent: Xenu's Link Sleuth 1.1c
    Disallow: /

    User-agent: Xenu's
    Disallow: /

    User-agent: Zeus
    Disallow: /

    User-agent: RepoMonkey Bait & Tackle/v1.01
    Disallow: /

    User-agent: RepoMonkey
    Disallow: /

    User-agent: Microsoft URL Control
    Disallow: /

    User-agent: Openbot
    Disallow: /

    User-agent: URL Control
    Disallow: /

    User-agent: Zeus Link Scout
    Disallow: /

    User-agent: Zeus 32297 Webster Pro V2.9 Win32
    Disallow: /

    User-agent: Webster Pro
    Disallow: /

    User-agent: EroCrawler
    Disallow: /

    User-agent: LinkScan/8.1a Unix
    Disallow: /

    User-agent: Keyword Density/0.9
    Disallow: /

    User-agent: Kenjin Spider
    Disallow: /

    User-agent: Iron33/1.0.2
    Disallow: /

    User-agent: Bookmark search tool
    Disallow: /

    User-agent: GetRight/4.2
    Disallow: /

    User-agent: FairAd Client
    Disallow: /

    User-agent: Gaisbot
    Disallow: /

    User-agent: Aqua_Products
    Disallow: /

    User-agent: Radiation Retriever 1.1
    Disallow: /

    User-agent: WebmasterWorld Extractor
    Disallow: /

    User-agent: Flaming AttackBot
    Disallow: /

    User-agent: Oracle Ultra Search
    Disallow: /

    User-agent: MSIECrawler
    Disallow: /

    User-agent: PerMan
    Disallow: /

    User-agent: searchpreview
    Disallow: /
     
    Christine8, Jun 9, 2006 IP