Robots.txt code to disallow by default but allow domain name

Discussion in 'robots.txt' started by Steviebone, Mar 4, 2015.

  1. #1
    I am trying to reconfigure a robots.txt file. I know this approach may be frowned upon but... I want to exclude everything except certain specified directories (instead of allowing everything except certain paths/files)

    Consider this block:

    User-agent: *
    Disallow: /
    Allow: /Dir1/
    Allow: /Dir2/
    Allow: /Dir3/
    Allow: /Dir4/
    
    Code (markup):
    This works except for one fatal flaw. It blocks the use of the default home page referenced by the url domain name alone, such as:

    www.domainname.com
    Code (markup):
    Since the 'index.htm' or whatever default file returned by the web-server is implied and not implicit the rule fails for the domain name by itself. I don't care much for the idea of allowing everything by default and then having to hunt down everything I don't want indexed/crawled. Whoever came up with this idea was creating crawlers [​IMG]

    I know you can allow subdirs after a disallow statement but how then can you handle anything in the root? Hell, that's the one place I want to limit. It seems like it would be much simpler to be able to just list areas of a site you want crawled, not the other way around. Am I crazy? Or is this just stupid?

    Any workarounds I can't see?
     
    Steviebone, Mar 4, 2015 IP