What is the correct way to allow spiders to crawl a specific folder within a directory you disallowed? I have (2) robots.txt files but I'm not sure which one will work? Please help! ROBOTS.TXT #1 Sitemap: http://www.jvfconsulting.com/sitemap.xml User-agent: * Allow: /amass/images/*?$ Disallow: /amass Disallow: /customwebsitedesignjvf Disallow: /cgi-bin ROBOTS.TXT #2 Sitemap: http://www.jvfconsulting.com/sitemap.xml User-agent: * Allow: /amass/images/ Disallow: /amass Disallow: /customwebsitedesignjvf Disallow: /cgi-bin
Use Google Webmaster Tools->Site Configuration->Crawler access You can modify your robots on the fly and test different pages on your website with different Google robots. I strongly recommend you to use it and test it there. Any mistake can result that Google will not index parts of your website, and I'm sure this is last thing you want (unless you really want it ).
Don't worry about the allow statement, anything that isn't explicitly disallowed will be allowed by default.
If you look at my robots.txt script above you see I 'm disallowing spiders to crawl our content management system "Disallow: /amass" but we want to only allow spiders to crawl the images within the content management system. Thats why I asked if the line "Allow: /amass/images/*?$" is the correct way to get spiders to crawl images within our CMS thats has a disallow on the main directory. Does that make sense? I hope someone can help, Google webmaster tools is really no help, is been 3 days since they last crawled my robots.txt file.
To my knowledge you cannot allow a subfolder of a folder that has been disallowed. At best, it will likely depend on which search engine it is and their interpretation of the robots.txt standard. If you want to acheive the above results then don't use robots.txt at all. Use <meta name="robots" content="noindex"> instead on the pages you do NOT want crawled/indexed. This is actually better than using robots.txt anyway because it absolutely prevents Google from showing your page in the SERPs. When you use robots.txt to keep crawlers from crawling a folder or page, that page can STILL be shown in Google's SERPs if it has inbound links pointing to it which allow Google to infer what the page is about from the link texts and Google thinks it's relevant to the search query. A page that has the <meta name="robots" content="noindex"> will not be shown in the Google SERPs under any circumstances . And if added to a page that is already indexed, it will cause Google to remove the page from their index (unlike robots.txt).
Thanks for your help social-media! I was afraid that you couldn't allow a specific folder that has been disallowed. I would really like to have our image folder crawled which is within our cms folder, but it sounds like we may be stuck.
Wooo Hooo! Our images are beginning to get indexed within google images! I guess the edits to our robots.txt file is finally taking into effect. http://images.google.com/images?q=site:jvfconsulting.com
You should not follow this advice! As you can see our robots are allowing google to index our images using the Allow: main/subfolder. See for yourself! This image is being indexed in google and it exists in a subfolder of a folder that has been disallowed. http://www.jvfconsulting.com/amass/images/product/1/amass_blog_adv.jpg