1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Dangerous Crawler and User Agent! Must Read!

Discussion in 'robots.txt' started by angellica2017, Sep 19, 2006.

  1. #1
    User-agent: Black Hole
    Disallow: /

    User-agent: Titan
    Disallow: /

    User-agent: WebStripper
    Disallow: /

    User-agent: NetMechanic
    Disallow: /

    User-agent: CherryPicker
    Disallow: /

    User-agent: EmailCollector
    Disallow: /

    User-agent: EmailSiphon
    Disallow: /

    User-agent: WebBandit
    Disallow: /

    User-agent: EmailWolf
    Disallow: /

    User-agent: ExtractorPro
    Disallow: /

    User-agent: CopyRightCheck
    Disallow: /

    User-agent: Crescent
    Disallow: /

    User-agent: NICErsPRO
    Disallow: /

    User-agent: Wget
    Disallow: /

    User-agent: SiteSnagger
    Disallow: /

    User-agent: ProWebWalker
    Disallow: /

    User-agent: CheeseBot
    Disallow: /

    User-agent: mozilla/4
    Disallow: /

    User-agent: mozilla/5
    Disallow: /

    User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
    Disallow: /

    User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
    Disallow: /

    User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 9
    Disallow: /

    User-agent: ia_archiver
    Disallow: /

    User-agent: ia_archiver/1.6
    Disallow: /

    User-agent: Alexibot
    Disallow: /

    User-agent: Teleport
    Disallow: /

    User-agent: TeleportPro
    Disallow: /

    User-agent: Wget
    Disallow: /

    User-agent: MIIxpc
    Disallow: /

    User-agent: Telesoft
    Disallow: /

    User-agent: Website Quester
    Disallow: /

    User-agent: WebZip
    Disallow: /

    User-agent: moget/2.1
    Disallow: /

    User-agent: WebZip/4.0
    Disallow: /

    User-agent: WebStripper
    Disallow: /

    User-agent: WebSauger
    Disallow: /

    User-agent: WebCopier
    Disallow: /

    User-agent: NetAnts
    Disallow: /

    User-agent: Mister PiX
    Disallow: /

    User-agent: WebAuto
    Disallow: /

    User-agent: TheNomad
    Disallow: /

    User-agent: WWW-Collector-E
    Disallow: /

    User-agent: RMA
    Disallow: /

    User-agent: libWeb/clsHTTPDisallow: /

    User-agent: asterias
    Disallow: /

    User-agent: turingos
    Disallow: /

    User-agent: spanner
    Disallow: /

    User-agent: InfoNaviRobot
    Disallow: /

    User-agent: Harvest/1.5
    Disallow: /

    User-agent: ExtractorPro
    Disallow: /

    User-agent: Bullseye/1.0
    Disallow: /

    User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
    Disallow: /

    User-agent: Crescent Internet ToolPak HTTPOLE Control v.1.0
    Disallow: /

    User-agent: CherryPickerSE/1.0
    Disallow: /

    User-agent: CherryPickerElite/1.0
    Disallow: /

    User-agent: WebBandit/3.50
    Disallow: /

    User-agent: NICErsPRO
    Disallow: /

    User-agent: Microsoft URL Control - 5.01.4511
    Disallow: /

    User-agent: DittoSpyder
    Disallow: /

    User-agent: Foobot
    Disallow: /

    User-agent: WebmasterWorldForumBot
    Disallow: /

    User-agent: SpankBot
    Disallow: /

    User-agent: BotALot
    Disallow: /

    User-agent: lwp-trivial/1.34
    Disallow: /

    User-agent: lwp-trivial
    Disallow: /

    User-agent: BunnySlippers
    Disallow: /

    User-agent: Microsoft URL Control - 6.00.8169
    Disallow: /

    User-agent: URLy Warning
    Disallow: /

    User-agent: Wget
    Disallow: /

    User-agent: Wget/1.5.3
    Disallow: /

    User-agent: LinkWalker
    Disallow: /

    User-agent: cosmos
    Disallow: /

    User-agent: moget
    Disallow: /

    User-agent: hloader
    Disallow: /

    User-agent: humanlinks
    Disallow: /

    User-agent: LinkextractorPro
    Disallow: /

    User-agent: Offline Explorer
    Disallow: /

    User-agent: Mata Hari
    Disallow: /

    User-agent: LexiBot
    Disallow: /

    User-agent: Offline Explorer
    Disallow: /

    User-agent: Web Image Collector
    Disallow: /

    User-agent: The Intraformant
    Disallow: /

    User-agent: True_Robot/1.0
    Disallow: /

    User-agent: True_Robot
    Disallow: /

    User-agent: BlowFish/1.0
    Disallow: /

    User-agent: JennyBot
    Disallow: /

    User-agent: MIIxpc/4.2
    Disallow: /

    User-agent: BuiltBotTough
    Disallow: /

    User-agent: ProPowerBot/2.14
    Disallow: /

    User-agent: BackDoorBot/1.0
    Disallow: /

    User-agent: toCrawl/UrlDispatcher
    Disallow: /

    User-agent: WebEnhancer
    Disallow: /

    User-agent: TightTwatBot
    Disallow: /

    User-agent: suzuran
    Disallow: /

    User-agent: VCI WebViewer VCI WebViewer Win32
    Disallow: /

    User-agent: VCI
    Disallow: /

    User-agent: Szukacz/1.4
    Disallow: /

    User-agent: QueryN Metasearch
    Disallow: /

    User-agent: Openfind data gathere
    Disallow: /

    User-agent: Openfind
    Disallow: /

    User-agent: Xenu's Link Sleuth 1.1c
    Disallow: /

    User-agent: Xenu's
    Disallow: /

    User-agent: Zeus
    Disallow: /

    User-agent: RepoMonkey Bait & Tackle/v1.01
    Disallow: /

    User-agent: RepoMonkey
    Disallow: /

    User-agent: Zeus 32297 Webster Pro V2.9 Win32
    Disallow: /

    User-agent: Webster Pro
    Disallow: /

    User-agent: EroCrawler
    Disallow: /

    User-agent: LinkScan/8.1a Unix Disallow: /

    User-agent: Kenjin Spider
    Disallow: /

    User-agent: Keyword Density/0.9
    Disallow: /

    User-agent: Kenjin Spider
    Disallow: /

    User-agent: Cegbfeieh
    Disallow: /

    Different:

    User-agent: larbin
    Disallow: /

    User-agent: b2w/0.1
    Disallow: /

    User-agent: Copernic
    Disallow: /

    User-agent: psbot
    Disallow: /

    User-agent: Python-urllib
    Disallow: /


    User-agent: NetMechanic
    Disallow: /

    User-agent: URL_Spider_Pro
    Disallow: /

    User-agent: CherryPicker
    Disallow: /

    User-agent: EmailCollector
    Disallow: /

    User-agent: EmailSiphon
    Disallow: /

    User-agent: WebBandit
    Disallow: /

    User-agent: EmailWolf
    Disallow: /

    User-agent: ExtractorPro
    Disallow: /

    User-agent: CopyRightCheck
    Disallow: /

    User-agent: Crescent
    Disallow: /

    User-agent: SiteSnagger
    Disallow: /

    User-agent: ProWebWalker
    Disallow: /

    User-agent: CheeseBot
    Disallow: /

    User-agent: LNSpiderguy
    Disallow: /

    User-agent: Mozilla
    Disallow: /

    User-agent: mozilla
    Disallow: /

    User-agent: mozilla/3
    Disallow: /

    User-agent: mozilla/4
    Disallow: /

    User-agent: mozilla/5
    Disallow: /

    User-agent: WebAuto
    Disallow: /

    User-agent: TheNomad
    Disallow: /

    User-agent: WWW-Collector-E
    Disallow: /

    User-agent: RMA
    Disallow: /

    User-agent: libWeb/clsHTTP
    Disallow: /

    User-agent: httplib
    Disallow: /

    User-agent: turingos
    Disallow: /

    User-agent: InfoNaviRobot
    Disallow: /

    User-agent: Harvest/1.5
    Disallow: /

    User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
    Disallow: /

    User-agent: CherryPickerSE/1.0
    Disallow: /

    User-agent: CherryPickerElite/1.0
    Disallow: /

    User-agent: WebBandit/3.50
    Disallow: /

    User-agent: NICErsPRO
    Disallow: /

    User-agent: DittoSpyder
    Disallow: /

    User-agent: Foobot
    Disallow: /

    User-agent: BotALot
    Disallow: /

    User-agent: lwp-trivial/1.34
    Disallow: /

    User-agent: lwp-trivial
    Disallow: /

    User-agent: URLy Warning
    Disallow: /

    User-agent: hloader
    Disallow: /

    User-agent: humanlinks
    Disallow: /

    User-agent: LinkextractorPro
    Disallow: /

    User-agent: Offline Explorer
    Disallow: /

    User-agent: Mata Hari
    Disallow: /

    User-agent: LexiBot
    Disallow: /

    User-agent: Web Image Collector
    Disallow: /

    User-agent: The Intraformant
    Disallow: /

    User-agent: True_Robot/1.0
    Disallow: /

    User-agent: True_Robot
    Disallow: /

    User-agent: BlowFish/1.0
    Disallow: /

    User-agent: JennyBot
    Disallow: /

    User-agent: MIIxpc/4.2
    Disallow: /

    User-agent: BuiltBotTough
    Disallow: /

    User-agent: ProPowerBot/2.14
    Disallow: /

    User-agent: BackDoorBot/1.0
    Disallow: /

    User-agent: toCrawl/UrlDispatcher
    Disallow: /

    User-agent: WebEnhancer
    Disallow: /

    User-agent: suzuran
    Disallow: /

    User-agent: VCI WebViewer VCI WebViewer Win32
    Disallow: /

    User-agent: VCI
    Disallow: /

    User-agent: Szukacz/1.4
    Disallow: /

    User-agent: QueryN Metasearch
    Disallow: /

    User-agent: Openfind data gathere
    Disallow: /

    User-agent: Openfind
    Disallow: /

    User-agent: Xenu's Link Sleuth 1.1c
    Disallow: /

    User-agent: Xenu's
    Disallow: /

    User-agent: Zeus
    Disallow: /

    User-agent: RepoMonkey Bait & Tackle/v1.01
    Disallow: /

    User-agent: RepoMonkey
    Disallow: /

    User-agent: Openbot
    Disallow: /

    User-agent: URL Control
    Disallow: /

    User-agent: Zeus Link Scout
    Disallow: /

    User-agent: Zeus 32297 Webster Pro V2.9 Win32
    Disallow: /

    User-agent: EroCrawler
    Disallow: /

    User-agent: LinkScan/8.1a Unix
    Disallow: /

    User-agent: Keyword Density/0.9
    Disallow: /

    User-agent: Kenjin Spider
    Disallow: /

    User-agent: Iron33/1.0.2
    Disallow: /

    User-agent: Bookmark search tool
    Disallow: /

    User-agent: GetRight/4.2
    Disallow: /

    User-agent: FairAd Client
    Disallow: /

    User-agent: Gaisbot
    Disallow: /

    User-agent: Aqua_Products
    Disallow: /

    User-agent: Radiation Retriever 1.1
    Disallow: /

    User-agent: WebmasterWorld Extractor
    Disallow: /

    User-agent: Flaming AttackBot
    Disallow: /

    User-agent: Oracle Ultra Search
    Disallow: /

    User-agent: MSIECrawler
    Disallow: /

    User-agent: PerMan
    Disallow: /

    User-agent: searchpreview
    Disallow: /

    User-agent: naver
    Disallow: /

    User-agent: dumbot
    Disallow: /

    User-agent: Hatena Antenna
    Disallow: /

    User-agent: grub-client
    Disallow: /

    User-agent: grub
    Disallow: /

    User-agent: larbin
    Disallow: /

    User-agent: b2w/0.1
    Disallow: /

    User-agent: Copernic
    Disallow: /

    User-agent: psbot
    Disallow: /

    User-agent: Python-urllib
    Disallow: /

    User-agent: EmailWolf
    Disallow: /

    User-agent: ExtractorPro
    Disallow: /

    User-agent: CopyRightCheck
    Disallow: /

    User-agent: Crescent
    Disallow: /

    User-agent: SiteSnagger
    Disallow: /

    User-agent: ProWebWalker
    Disallow: /

    User-agent: CheeseBot
    Disallow: /

    User-agent: LNSpiderguy
    Disallow: /

    User-agent: Mister PiX
    Disallow: /

    User-agent: WebAuto
    Disallow: /

    User-agent: TheNomad
    Disallow: /

    User-agent: WWW-Collector-E
    Disallow: /

    User-agent: RMA
    Disallow: /

    User-agent: httplib
    Disallow: /

    User-agent: turingos
    Disallow: /

    User-agent: InfoNaviRobot
    Disallow: /

    User-agent: Harvest/1.5
    Disallow: /

    User-agent: Bullseye/1.0
    Disallow: /

    User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
    Disallow: /

    User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
    Disallow: /

    User-agent: CherryPickerSE/1.0
    Disallow: /

    User-agent: CherryPickerElite/1.0
    Disallow: /

    User-agent: NICErsPRO
    Disallow: /

    User-agent: URLy Warning
    Disallow: /

    User-agent: humanlinks
    Disallow: /

    User-agent: Web Image Collector
    Disallow: /

    User-agent: The Intraformant
    Disallow: /

    User-agent: True_Robot/1.0
    Disallow: /

    User-agent: True_Robot
    Disallow: /

    User-agent: BlowFish/1.0
    Disallow: /

    User-agent: JennyBot
    Disallow: /

    User-agent: MIIxpc/4.2
    Disallow: /

    User-agent: BuiltBotTough
    Disallow: /

    User-agent: ProPowerBot/2.14
    Disallow: /

    User-agent: BackDoorBot/1.0
    Disallow: /

    User-agent: toCrawl/UrlDispatcher
    Disallow: /

    User-agent: WebEnhancer
    Disallow: /

    User-agent: suzuran
    Disallow: /

    User-agent: VCI WebViewer VCI WebViewer Win32
    Disallow: /

    User-agent: VCI
    Disallow: /

    User-agent: Szukacz/1.4
    Disallow: /

    User-agent: QueryN Metasearch
    Disallow: /

    User-agent: Openfind data gathere
    Disallow: /

    User-agent: Openfind
    Disallow: /

    User-agent: Xenu's Link Sleuth 1.1c
    Disallow: /

    User-agent: Xenu's
    Disallow: /

    User-agent: Zeus
    Disallow: /

    User-agent: RepoMonkey Bait & Tackle/v1.01
    Disallow: /

    User-agent: RepoMonkey
    Disallow: /

    User-agent: Microsoft URL Control
    Disallow: /

    User-agent: Openbot
    Disallow: /

    User-agent: URL Control
    Disallow: /

    User-agent: Zeus Link Scout
    Disallow: /

    User-agent: Zeus 32297 Webster Pro V2.9 Win32
    Disallow: /

    User-agent: Webster Pro
    Disallow: /

    User-agent: EroCrawler
    Disallow: /

    User-agent: LinkScan/8.1a Unix
    Disallow: /

    User-agent: Keyword Density/0.9
    Disallow: /

    User-agent: Kenjin Spider
    Disallow: /

    User-agent: Iron33/1.0.2
    Disallow: /

    User-agent: Bookmark search tool
    Disallow: /

    User-agent: GetRight/4.2
    Disallow: /

    User-agent: FairAd Client
    Disallow: /

    User-agent: Gaisbot
    Disallow: /

    User-agent: Aqua_Products
    Disallow: /

    User-agent: Radiation Retriever 1.1
    Disallow: /

    User-agent: WebmasterWorld Extractor
    Disallow: /

    User-agent: Flaming AttackBot
    Disallow: /

    User-agent: Oracle Ultra Search
    Disallow: /

    User-agent: MSIECrawler
    Disallow: /

    User-agent: PerMan
    Disallow: /

    User-agent: searchpreview
    Disallow: /

    User-agent: sootle
    Disallow: /

    User-agent: es
    Disallow: /

    User-agent: Enterprise_Search/1.0
    Disallow: /

    User-agent: Enterprise_Search
    Disallow: /
     
    angellica2017, Sep 19, 2006 IP
  2. explorer

    explorer Well-Known Member

    Messages:
    463
    Likes Received:
    40
    Best Answers:
    0
    Trophy Points:
    110
    #2
    Hi, could you cite a source (or sources) for this list please, or have you compiled it from your own logs? Thanks.
     
    explorer, Sep 19, 2006 IP
  3. un1x

    un1x Well-Known Member

    Messages:
    259
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    125
    #3
    Yes where are you getting all this from?
     
    un1x, Sep 19, 2006 IP
  4. mdvaldosta

    mdvaldosta Peon

    Messages:
    4,079
    Likes Received:
    362
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Wow that's a helluva list. In all seriousness though, if it was a bad bot (scraper) why would it obey a robots.txt?
     
    mdvaldosta, Sep 19, 2006 IP
  5. angellica2017

    angellica2017 Peon

    Messages:
    351
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #5
    stealing bandwidth?
     
    angellica2017, Sep 23, 2006 IP
  6. pankaj.sharma

    pankaj.sharma Peon

    Messages:
    75
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    User-agent: Keyword Density/0.9
    Can u tell me how's above user agent effect a site
     
    pankaj.sharma, Jan 1, 2008 IP
  7. intruth

    intruth Guest

    Messages:
    191
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #7
    So just copy and paste that in robots.txt???
     
    intruth, Jan 2, 2008 IP
  8. agentvic

    agentvic Peon

    Messages:
    33
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    I don't know about all, but many I seen are useless bandwidth hogs.
     
    agentvic, Jan 3, 2008 IP
  9. Ladadadada

    Ladadadada Peon

    Messages:
    382
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Wouldn't it just be easier to
    
    UserAgent: Googlebot
    Allow: /
    
    UserAgent: Slurp
    Allow: /
    
    UserAgent: *
    Disallow: /
    
    Code (markup):
    I can't remember whether you need the Disallow first or last but that would make your robots.txt file a lot smaller... which would save on bandwidth.
     
    Ladadadada, Jan 6, 2008 IP
  10. SwapsRulez

    SwapsRulez Peon

    Messages:
    32
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Wow.... very nice list of all bad bots.... ;)
     
    SwapsRulez, Jan 12, 2008 IP
  11. caj

    caj Active Member

    Messages:
    748
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    75
    #11
    caj, Jan 13, 2008 IP
  12. nealrm

    nealrm Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I don't understand this post. Other than grabbing a small amount of band width, what can this crawler do??
     
    nealrm, Jan 13, 2008 IP
  13. SwapsRulez

    SwapsRulez Peon

    Messages:
    32
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Just add the whole listed contents from there in the "/robots.txt". It will start working when checking the robots.txt


    Hi mate, this post is meant for the bad robots that can be harmful to steal the e-mail addresses from your site & used them for spamming. Also they can steal the bandwidth by repeatavily downloading or crawling your pages... though bad bot are also able to ignore the files "/robots.txt", its always to be safe to take care of them. :cool: Peace! ;)
     
    SwapsRulez, Jan 13, 2008 IP
  14. xaralee

    xaralee Well-Known Member

    Messages:
    1,316
    Likes Received:
    70
    Best Answers:
    0
    Trophy Points:
    140
    #14
    so...those two things above have the same purpose ? if so...it would be nice to choose the second one..it's much more simple.if not...is there any explanation why ?
     
    xaralee, Feb 1, 2008 IP
  15. gpmgroup

    gpmgroup Peon

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #15
    No the second one would have a massive impact on traffic from search engines. Depending on how well your site is designed you could loose 50% of your visitors.

    You need to be very careful playing with the robots.txt file it's very easy to get your site removed from all the search engines by getting one character wrong.
     
    gpmgroup, Feb 2, 2008 IP
  16. xaralee

    xaralee Well-Known Member

    Messages:
    1,316
    Likes Received:
    70
    Best Answers:
    0
    Trophy Points:
    140
    #16
    thank you for your explanation.
    so which one is the best ? the first or the second one ?

    Much appreciate it.
     
    xaralee, Feb 2, 2008 IP
  17. gpmgroup

    gpmgroup Peon

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #17
    It really depends what you want to achieve the nature of your site and how much time you want to spend on it.

    Wildcarding a disallow means that everything not explicitly listed will be excluded In the above example that would exclude Microsoft's MSN and Live search engines for example. (It is in fact even more complex in that some sites read the robots.txt file but some ignore wild carding and require explicit exclude)


    The other option posted by angellica2017 is to exclude as many minor robots as possible. The problem here is that the agents keep changing and the naughty guys just ignore the robots.txt file.

    Another use to is to exclude the directories you don't want search engines to index for forum sites and duplicate content issues)

    The best way is to check out a few sites

    http://www.digitalpoint.com/robots.txt
    http://www.hp.com/robots.txt

    etc

    The problem with not having a robots.txt file at all means the log files are full of 404's so a simple one can be useful

    To allow all robots

    User-agent: *
    Disallow:

    To disallow the well behaved robots

    User-agent: *
    Disallow: /

    (Though as I said above some of the little bit naughty ones only recognize the disallow if you explicitly name them and the really naughty ones don't care want you put in it :) )
     
    gpmgroup, Feb 2, 2008 IP
  18. Kaizoku

    Kaizoku Well-Known Member

    Messages:
    1,261
    Likes Received:
    20
    Best Answers:
    1
    Trophy Points:
    105
    #18
    Any advanced attacking tool can fake the user-agent.
     
    Kaizoku, Feb 2, 2008 IP
  19. xaralee

    xaralee Well-Known Member

    Messages:
    1,316
    Likes Received:
    70
    Best Answers:
    0
    Trophy Points:
    140
    #19
    @gpmgroup
    i see....thank you for the details :)
     
    xaralee, Feb 4, 2008 IP
  20. Tom Strong

    Tom Strong Peon

    Messages:
    73
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #20
    Fixed.

    The dumber spiders dont fallow robots.txt, although clever ones will try to fallow some rules preventing them from getting caught in spider traps, which most webmasters dont have. You can mimic this little robots.txt by writing some lines in your htaccess file. You can even only allow known spiders from a fixed list of known ips..

    BTW : The Allow directive is not oficially supported by the big search engines..

    Fixed list of user agents is a no no..
     
    Tom Strong, Feb 5, 2008 IP