1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Robots.txt code to disallow by default but allow domain name

Discussion in 'robots.txt' started by Steviebone, Mar 4, 2015.

  1. #1
    I am trying to reconfigure a robots.txt file. I know this approach may be frowned upon but... I want to exclude everything except certain specified directories (instead of allowing everything except certain paths/files)

    Consider this block:

    User-agent: *
    Disallow: /
    Allow: /Dir1/
    Allow: /Dir2/
    Allow: /Dir3/
    Allow: /Dir4/
    
    Code (markup):
    This works except for one fatal flaw. It blocks the use of the default home page referenced by the url domain name alone, such as:
    SEMrush
    www.domainname.com
    Code (markup):
    Since the 'index.htm' or whatever default file returned by the web-server is implied and not implicit the rule fails for the domain name by itself. I don't care much for the idea of allowing everything by default and then having to hunt down everything I don't want indexed/crawled. Whoever came up with this idea was creating crawlers [​IMG]

    I know you can allow subdirs after a disallow statement but how then can you handle anything in the root? Hell, that's the one place I want to limit. It seems like it would be much simpler to be able to just list areas of a site you want crawled, not the other way around. Am I crazy? Or is this just stupid?

    Any workarounds I can't see?
     
    Steviebone, Mar 4, 2015 IP
    SEMrush