Robots.txt

Discussion in 'Search Engine Optimization' started by elle19570, Oct 13, 2006.

  1. #1
    Hi,

    Writing robots.txt in following format is OK or not please guide

    User-agent: *
    Disallow: /cgi-bin/
    Disallow:


    If I disallowed following crawler (bandwidth eating crawler) in my robots.txt, will it affect my crawling:

    User-agent: Flashget
    Disallow: /

    User-agent: Offline
    Disallow: /

    User-agent: Teleport
    Disallow: /

    User-agent: Downloader
    Disallow: /

    User-agent: reaper
    Disallow: /

    User-agent: WebZIP
    Disallow: /

    User-agent: Website Quester
    Disallow: /

    User-agent: MSIECrawler
    Disallow: /

    User-agent: FAST-WebCrawler
    Disallow: /

    User-agent: Gulliver
    Disallow: /

    User-agent: WebCapture
    Disallow: /

    User-agent: HTTrack
    Disallow: /

    User-agent: Fetch API Request
    Disallow: /

    User-agent: NetAnts
    Disallow: /

    User-agent: SuperBot
    Disallow: /

    User-agent: WebCopier
    Disallow: /

    User-agent: WebStripper
    Disallow: /

    User-agent: Wget
    Disallow: /

    User-agent: EmailSiphon
    Disallow: /

    User-agent: MSProxy/2.0
    Disallow: /

    User-agent: EmailWolf
    Disallow: /

    User-agent: webbandit
    Disallow: /

    User-agent: MS FrontPage
    Disallow: /
     
    elle19570, Oct 13, 2006 IP
  2. Cryogenius

    Cryogenius Peon

    Messages:
    1,280
    Likes Received:
    118
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You will only stop those bad crawlers if they bother to check your robots.txt file. For example, still I could use 'wget' on your site to download every webpage to my computer.

    If you are really worried about it, then look into using a .htaccess file to block those user agents.

    Cryo.
     
    Cryogenius, Oct 13, 2006 IP
  3. Jean-Luc

    Jean-Luc Peon

    Messages:
    601
    Likes Received:
    30
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Hi,
    User-agent: *
    Disallow: /cgi-bin/
    Disallow:
    Code (markup):
    This is not correct.

    Disallow: allows access to all URL's. If you use it, you should not disallow anything else within the same group of directives.

    User-agent: *
    Disallow: /cgi-bin/
    Code (markup):
    This is the correct way to allow access to all URL's but the ones starting with /cgi-bin/.

    Jean-Luc
     
    Jean-Luc, Oct 13, 2006 IP
  4. Pat Gael

    Pat Gael Banned

    Messages:
    1,331
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    0
    #4
    My robots.txt is like this:

    
    User-agent: *
    User-agent: Googlebot-Image
    User-Agent: Googlebot
    User-agent: Mediapartners-Google/2.1
    User-agent: Mediapartners-Google*
    User-agent: MSNBot
    User-agent: msnbot-NewsBlogs
    User-agent: Slurp
    User-agent: yahoo-mmcrawler
    User-agent: yahoo-blogs/v3.9
    User-agent: Gigabot
    User-agent: ia_archiver
    User-agent: BotRightHere 
    User-agent: larbin 
    User-agent: b2w/0.1 
    User-agent: Copernic 
    User-agent: psbot 
    User-agent: Python-urllib 
    User-agent: NetMechanic 
    User-agent: URL_Spider_Pro 
    User-agent: CherryPicker 
    User-agent: EmailCollector 
    User-agent: EmailSiphon 
    User-agent: WebBandit 
    User-agent: EmailWolf 
    User-agent: ExtractorPro 
    User-agent: CopyRightCheck 
    User-agent: Crescent 
    User-agent: SiteSnagger 
    User-agent: ProWebWalker 
    User-agent: CheeseBot 
    User-agent: LNSpiderguy 
    User-agent: Alexibot 
    User-agent: Teleport 
    User-agent: TeleportPro 
    User-agent: MIIxpc 
    User-agent: Telesoft 
    User-agent: Website Quester 
    User-agent: WebZip 
    User-agent: moget/2.1 
    User-agent: WebZip/4.0 
    User-agent: WebStripper 
    User-agent: WebSauger 
    User-agent: WebCopier 
    User-agent: NetAnts 
    User-agent: Mister PiX 
    User-agent: WebAuto 
    User-agent: TheNomad 
    User-agent: WWW-Collector-E 
    User-agent: RMA 
    User-agent: libWeb/clsHTTP 
    User-agent: asterias 
    User-agent: httplib 
    User-agent: turingos 
    User-agent: spanner 
    User-agent: InfoNaviRobot 
    User-agent: Harvest/1.5 
    User-agent: Bullseye/1.0 
    User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) 
    User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 
    User-agent: CherryPickerSE/1.0 
    User-agent: CherryPickerElite/1.0 
    User-agent: WebBandit/3.50 
    User-agent: NICErsPRO 
    User-agent: DittoSpyder 
    User-agent: Foobot 
    User-agent: SpankBot 
    User-agent: BotALot 
    User-agent: lwp-trivial/1.34 
    User-agent: lwp-trivial 
    User-agent: BunnySlippers 
    User-agent: URLy Warning 
    User-agent: Wget/1.6 
    User-agent: Wget/1.5.3 
    User-agent: Wget 
    User-agent: LinkWalker 
    User-agent: cosmos 
    User-agent: moget 
    User-agent: hloader 
    User-agent: humanlinks 
    User-agent: LinkextractorPro 
    User-agent: Offline Explorer 
    User-agent: Mata Hari 
    User-agent: LexiBot 
    User-agent: Web Image Collector 
    User-agent: The Intraformant 
    User-agent: True_Robot/1.0 
    User-agent: True_Robot 
    User-agent: BlowFish/1.0 
    User-agent: JennyBot 
    User-agent: MIIxpc/4.2 
    User-agent: BuiltBotTough 
    User-agent: ProPowerBot/2.14 
    User-agent: BackDoorBot/1.0 
    User-agent: toCrawl/UrlDispatcher 
    User-agent: suzuran 
    User-agent: TightTwatBot 
    User-agent: VCI WebViewer VCI WebViewer Win32 
    User-agent: VCI 
    User-agent: Szukacz/1.4 
    User-agent: Openfind data gatherer 
    User-agent: Openfind 
    User-agent: Xenu's Link Sleuth 1.1c 
    User-agent: Xenu's 
    User-agent: Zeus 
    User-agent: RepoMonkey Bait & Tackle/v1.01 
    User-agent: RepoMonkey 
    User-agent: Openbot 
    User-agent: URL Control 
    User-agent: Zeus Link Scout 
    User-agent: Zeus 32297 Webster Pro V2.9 Win32 
    User-agent: Webster Pro 
    User-agent: EroCrawler 
    User-agent: LinkScan/8.1a Unix 
    User-agent: Keyword Density/0.9 
    User-agent: Kenjin Spider 
    User-agent: Iron33/1.0.2 
    User-agent: Bookmark search tool 
    User-agent: GetRight/4.2 
    User-agent: FairAd Client 
    User-agent: Gaisbot 
    User-agent: Aqua_Products 
    User-agent: Radiation Retriever 1.1 
    User-agent: Flaming AttackBot 
    User-agent: Curl 
    User-agent: Web Reaper
    User-agent: Firefox
    User-agent: Opera
    User-agent: Netscape
    User-agent: WebVulnCrawl
    User-agent: WebVulnScan
    Disallow: /
    
    Code (markup):
    However the above is not to be placed in the root directory, but in each of those directories you don't want to be crawled, in addition to .htaccess this way:

    
    order deny,allow
    deny from all
    
    Code (markup):
    For root robots.txt it's advisable not disclose which directories you are trying to prevent access because anyone can look into that file to find them out just pointing their browsers to www.your_domain.ext/robots.txt
     
    Pat Gael, Oct 13, 2006 IP
  5. Jean-Luc

    Jean-Luc Peon

    Messages:
    601
    Likes Received:
    30
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Robots only look at the robots.txt file in the root directory. If you place robots.txt files in other directories, no robot will look at these files.

    On top of that, would your robots.txt file be placed in the root directory, it would disallow all robots everywhere in the site,exactly like this one would do:
    User-agent: *
    Disallow: /
    Code (markup):
    Jean-Luc
     
    Jean-Luc, Oct 13, 2006 IP
  6. elle19570

    elle19570 Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thank you very much for the guidance
     
    elle19570, Oct 16, 2006 IP