1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Please help me understanding robots.txt file

Discussion in 'robots.txt' started by MyArtGallery, Sep 6, 2011.

  1. #1
    Hi

    If I add the following robots.txt file does Google will exclude all the URLs starting with index.php inside "dir1" or only the file index.php inside "dir1" directory?

    User-agent: *
    Disallow: /dir1/index.php
    Allow: /

    I mean this will disallow only the file www.site/dir1/index.php URL ...or does it means Google will disallow also all the urls that starts with index.php inside the "dir1" directory? for example: www.site/dir1/index.php-product_IDx.html

    thanks
     
    MyArtGallery, Sep 6, 2011 IP
  2. prince@life

    prince@life Notable Member

    Messages:
    278
    Likes Received:
    13
    Best Answers:
    3
    Trophy Points:
    225
    #2
    Even i am not powerful in this, but now i have studied about it and answered you.
    Use this one for your robots.txt

    User-agent: *
    Disallow: /dir1/index.php*
    Allow: /


    symbol * means it will include index.php as well as all URLs That starts with index.php-abc-xyz or index.php-abc-456-hght.html whatever.
     
    prince@life, Sep 12, 2011 IP
  3. christopherscott

    christopherscott Peon

    Messages:
    32
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    It will disallow to crawler for crawling that urls started by index.php and allow all that pages to crawl leaving index.php.
     
    christopherscott, Sep 15, 2011 IP
  4. jabz.biz

    jabz.biz Active Member

    Messages:
    384
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    70
    #4
    Unfortunately this is complete nonsense. The wildcard (*) on User-agents effects all bots obeying the robots.txt protocol.
    There is no wildcard for the Disallow directive!

    Learn about robots.txt and the robots exclusion standard here: http://rield.com/cheat-sheets/robots-exclusion-standard-protocol

    Also, if you - MyArtGallery - want to get rid of the the index.php files appearing in Google or to users, you should write a rewrite rule for that. Using robots.txt to tackle that issue will not be very effective. What you want to do is to canonicalize your directory index. I wrote a how-to on that, too:

    http://rield.com/how-to/directoryindex-canonicalization

    All examples work for index.php files as well. Just change .html to .php in both lines of your .htaccess file.

    I hope this helps to solve your initial problem.
     
    jabz.biz, Sep 26, 2011 IP
  5. FOREXSIGNAL

    FOREXSIGNAL Active Member

    Messages:
    71
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    61
    #5
    Thanks for sharing.
     
    FOREXSIGNAL, Sep 28, 2011 IP
  6. amherstsowell

    amherstsowell Peon

    Messages:
    261
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #6
    It will stop to crawler to crawl all file who have index.php in url.
     
    amherstsowell, Sep 29, 2011 IP
  7. ChildrensEntertainer

    ChildrensEntertainer Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I put up a robot txt file with a * for pdf files . When I checked with my host they said it wouldn't work as you have to specify each file individually so it seems there are some different opinions on this. So far using * hasn't worked because the file is still indexed
     
    ChildrensEntertainer, Nov 7, 2011 IP
  8. pr0t0n

    pr0t0n Well-Known Member

    Messages:
    243
    Likes Received:
    10
    Best Answers:
    10
    Trophy Points:
    128
    #8
    It takes some time to deindex already indexed url. Make sure to give it enough time to update, and for search engines to apply your robots.txt directive. According to Google help, to block all .pdf files you need something like this:

    
    User-agent: *
    Disallow: /*.pdf$
    
    Code (markup):
    Quote from Google webmasters help:
     
    pr0t0n, Nov 8, 2011 IP
  9. pr0t0n

    pr0t0n Well-Known Member

    Messages:
    243
    Likes Received:
    10
    Best Answers:
    10
    Trophy Points:
    128
    #9
    For Googlebot, yes there is. For others I can't say for sure.

    http://www.google.com/support/webmasters/bin/answer.py?answer=156449

     
    pr0t0n, Nov 8, 2011 IP
  10. pr0t0n

    pr0t0n Well-Known Member

    Messages:
    243
    Likes Received:
    10
    Best Answers:
    10
    Trophy Points:
    128
    #10
    Since you asked for Google specifically (62 days ago....), according to their webmasters help pages, this rule should be, like prince@life perviously mentioned:

    
    User-agent: *
    Disallow: /dir1/index.php*
    
    Code (markup):
    I haven't tested it, but their help pages say so. I posted the link to Google help page in previous post.
     
    Last edited: Nov 8, 2011
    pr0t0n, Nov 8, 2011 IP