Correct Way To Use Allow & Disallow In Robots.txt?

Discussion in 'Search Engine Optimization' started by jvfconsulting, Mar 29, 2010.

  1. #1
    What is the correct way to allow spiders to crawl a specific folder within a directory you disallowed? I have (2) robots.txt files but I'm not sure which one will work? Please help!

    ROBOTS.TXT #1
    Sitemap: http://www.jvfconsulting.com/sitemap.xml
    User-agent: *
    Allow: /amass/images/*?$
    Disallow: /amass
    Disallow: /customwebsitedesignjvf
    Disallow: /cgi-bin

    ROBOTS.TXT #2
    Sitemap: http://www.jvfconsulting.com/sitemap.xml
    User-agent: *
    Allow: /amass/images/
    Disallow: /amass
    Disallow: /customwebsitedesignjvf
    Disallow: /cgi-bin
     
    jvfconsulting, Mar 29, 2010 IP
  2. Montreal Classifieds

    Montreal Classifieds Active Member

    Messages:
    808
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    60
    #2
    Use Google Webmaster Tools->Site Configuration->Crawler access
    You can modify your robots on the fly and test different pages on your website with different Google robots.
    I strongly recommend you to use it and test it there. Any mistake can result that Google will not index parts of your website, and I'm sure this is last thing you want (unless you really want it :)).
     
    Montreal Classifieds, Mar 29, 2010 IP
  3. DoA

    DoA Peon

    Messages:
    531
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Don't worry about the allow statement, anything that isn't explicitly disallowed will be allowed by default.

     
    DoA, Mar 29, 2010 IP
  4. jvfconsulting

    jvfconsulting Active Member

    Messages:
    1,089
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    90
    #4
    If you look at my robots.txt script above you see I 'm disallowing spiders to crawl our content management system "Disallow: /amass" but we want to only allow spiders to crawl the images within the content management system. Thats why I asked if the line "Allow: /amass/images/*?$" is the correct way to get spiders to crawl images within our CMS thats has a disallow on the main directory. Does that make sense? I hope someone can help, Google webmaster tools is really no help, is been 3 days since they last crawled my robots.txt file.
     
    jvfconsulting, Mar 31, 2010 IP
  5. social-media

    social-media Member

    Messages:
    311
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    35
    #5
    To my knowledge you cannot allow a subfolder of a folder that has been disallowed. At best, it will likely depend on which search engine it is and their interpretation of the robots.txt standard.

    If you want to acheive the above results then don't use robots.txt at all. Use <meta name="robots" content="noindex"> instead on the pages you do NOT want crawled/indexed. This is actually better than using robots.txt anyway because it absolutely prevents Google from showing your page in the SERPs. When you use robots.txt to keep crawlers from crawling a folder or page, that page can STILL be shown in Google's SERPs if it has inbound links pointing to it which allow Google to infer what the page is about from the link texts and Google thinks it's relevant to the search query. A page that has the <meta name="robots" content="noindex"> will not be shown in the Google SERPs under any circumstances . And if added to a page that is already indexed, it will cause Google to remove the page from their index (unlike robots.txt).
     
    social-media, Mar 31, 2010 IP
  6. jvfconsulting

    jvfconsulting Active Member

    Messages:
    1,089
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    90
    #6
    Thanks for your help social-media! I was afraid that you couldn't allow a specific folder that has been disallowed. I would really like to have our image folder crawled which is within our cms folder, but it sounds like we may be stuck.
     
    jvfconsulting, Mar 31, 2010 IP
  7. jvfconsulting

    jvfconsulting Active Member

    Messages:
    1,089
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    90
    #7
    jvfconsulting, May 10, 2010 IP
  8. glassblock

    glassblock Greenhorn

    Messages:
    39
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    16
    #8
    Thanks social-media,very interesting advice.
     
    glassblock, May 11, 2010 IP
  9. jvfconsulting

    jvfconsulting Active Member

    Messages:
    1,089
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    90
    #9
    You should not follow this advice! As you can see our robots are allowing google to index our images using the Allow: main/subfolder. See for yourself!
    This image is being indexed in google and it exists in a subfolder of a folder that has been disallowed. http://www.jvfconsulting.com/amass/images/product/1/amass_blog_adv.jpg
     
    jvfconsulting, May 12, 2010 IP