Correct Way To Use Allow & Disallow In Robots.txt?

jvfconsulting Active Member

Messages:: 1,089

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 90

#1

What is the correct way to allow spiders to crawl a specific folder within a directory you disallowed? I have (2) robots.txt files but I'm not sure which one will work? Please help!

ROBOTS.TXT #1
Sitemap: http://www.jvfconsulting.com/sitemap.xml
User-agent: *
Allow: /amass/images/*?$
Disallow: /amass
Disallow: /customwebsitedesignjvf
Disallow: /cgi-bin

ROBOTS.TXT #2
Sitemap: http://www.jvfconsulting.com/sitemap.xml
User-agent: *
Allow: /amass/images/
Disallow: /amass
Disallow: /customwebsitedesignjvf
Disallow: /cgi-bin

jvfconsulting, Mar 29, 2010 IP

Montreal Classifieds Active Member

Messages:: 808

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 60

#2

Use Google Webmaster Tools->Site Configuration->Crawler access
You can modify your robots on the fly and test different pages on your website with different Google robots.
I strongly recommend you to use it and test it there. Any mistake can result that Google will not index parts of your website, and I'm sure this is last thing you want (unless you really want it ).

Montreal Classifieds, Mar 29, 2010 IP

DoA Peon

Messages:: 531

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 0

#3

Don't worry about the allow statement, anything that isn't explicitly disallowed will be allowed by default.

Sitemap: http://www.jvfconsulting.com/sitemap.xml
User-agent: *
Disallow: /amass
Disallow: /customwebsitedesignjvf
Disallow: /cgi-bin
Click to expand...

DoA, Mar 29, 2010 IP

jvfconsulting Active Member

Messages:: 1,089

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 90

#4

DoA said: ↑

Don't worry about the allow statement, anything that isn't explicitly disallowed will be allowed by default.
Click to expand...

If you look at my robots.txt script above you see I 'm disallowing spiders to crawl our content management system "Disallow: /amass" but we want to only allow spiders to crawl the images within the content management system. Thats why I asked if the line "Allow: /amass/images/*?$" is the correct way to get spiders to crawl images within our CMS thats has a disallow on the main directory. Does that make sense? I hope someone can help, Google webmaster tools is really no help, is been 3 days since they last crawled my robots.txt file.

jvfconsulting, Mar 31, 2010 IP

social-media Member

Messages:: 311

Likes Received:: 9

Best Answers:: 0

Trophy Points:: 35

#5

To my knowledge you cannot allow a subfolder of a folder that has been disallowed. At best, it will likely depend on which search engine it is and their interpretation of the robots.txt standard.

If you want to acheive the above results then don't use robots.txt at all. Use <meta name="robots" content="noindex"> instead on the pages you do NOT want crawled/indexed. This is actually better than using robots.txt anyway because it absolutely prevents Google from showing your page in the SERPs. When you use robots.txt to keep crawlers from crawling a folder or page, that page can STILL be shown in Google's SERPs if it has inbound links pointing to it which allow Google to infer what the page is about from the link texts and Google thinks it's relevant to the search query. A page that has the <meta name="robots" content="noindex"> will not be shown in the Google SERPs under any circumstances . And if added to a page that is already indexed, it will cause Google to remove the page from their index (unlike robots.txt).

social-media, Mar 31, 2010 IP

jvfconsulting Active Member

Messages:: 1,089

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 90

#6

Thanks for your help social-media! I was afraid that you couldn't allow a specific folder that has been disallowed. I would really like to have our image folder crawled which is within our cms folder, but it sounds like we may be stuck.

jvfconsulting, Mar 31, 2010 IP

jvfconsulting Active Member

Messages:: 1,089

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 90

#7

Wooo Hooo! Our images are beginning to get indexed within google images! I guess the edits to our robots.txt file is finally taking into effect.

http://images.google.com/images?q=site:jvfconsulting.com

jvfconsulting, May 10, 2010 IP

glassblock Greenhorn

Messages:: 39

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 16

#8

social-media said: ↑

To my knowledge you cannot allow a subfolder of a folder that has been disallowed. At best, it will likely depend on which search engine it is and their interpretation of the robots.txt standard.

If you want to acheive the above results then don't use robots.txt at all. Use <meta name="robots" content="noindex"> instead on the pages you do NOT want crawled/indexed. This is actually better than using robots.txt anyway because it absolutely prevents Google from showing your page in the SERPs. When you use robots.txt to keep crawlers from crawling a folder or page, that page can STILL be shown in Google's SERPs if it has inbound links pointing to it which allow Google to infer what the page is about from the link texts and Google thinks it's relevant to the search query. A page that has the <meta name="robots" content="noindex"> will not be shown in the Google SERPs under any circumstances . And if added to a page that is already indexed, it will cause Google to remove the page from their index (unlike robots.txt).
Click to expand...

Thanks social-media,very interesting advice.

glassblock, May 11, 2010 IP

jvfconsulting Active Member

Messages:: 1,089

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 90

#9

glassblock said: ↑

Thanks social-media,very interesting advice.
Click to expand...

You should not follow this advice! As you can see our robots are allowing google to index our images using the Allow: main/subfolder. See for yourself!
This image is being indexed in google and it exists in a subfolder of a folder that has been disallowed. http://www.jvfconsulting.com/amass/images/product/1/amass_blog_adv.jpg

jvfconsulting, May 12, 2010 IP

Log in or Sign up

Correct Way To Use Allow & Disallow In Robots.txt?

jvfconsulting Active Member

Montreal Classifieds Active Member

DoA Peon

jvfconsulting Active Member

social-media Member

jvfconsulting Active Member

jvfconsulting Active Member

glassblock Greenhorn

jvfconsulting Active Member

Useful Searches