Hey there, if I want to block Google (and all other subsequent search engines) from a particular directory, can I do this? Let's say I want www.example.com/widgets directory to be indexed, and also www.examples.com/blue but not www.example.com/blue/widgets, is this exceptable? User-agent: * Disallow: /blue/widgets/ The reason I ask is I can't find an example that has two directories together, and I don't want to block out www.example.com/widgets or www.examples.com/blue at all, but just the combination. Will this work? Thanks. Note: Rep to whoever helps me
as far as i can tell this should work. Should allow anything BUT the /blue/widgets/ folder. As a double check, you could always run a sitemap, eg. from auditmypc.com and check what the sitemap reads and what it indexes.
User-agent: * Disallow: /blue/widgets/ Is correct. This command will not effect the /blue directory. It will tell bots "do not index whatever is in the /widgets folder". But, you seem to misunderstand the robots file. It can not "block" a crawler. It simply instructs the crawler *do not index* these pages. The crawler can still access the page and analyze its content. If you have stuff in the folder that you don't want the crawler to even see, you need to protect the folder. /*tom*/
Thanks tom, I don't mind them not accessing the folder, just don't want it indexed. It's a virtual folder anyway, so no way to protect it (if I needed to). I still want the example.com/widgets folder to be indexed, just not example.com/blue/widgets thanks.
Your example code looks fine. In the future, you might want to check out the robots.txt tools in Google's Webmaster Tools. It will let you test robots.txt code to see if it works the way you want. Just in case you weren't aware of this, note that blocking URLs in your robots will not remove any URLs that are already in the index. It just prevents crawling. If this situation arises for you again, the best course is to add a robots <meta> tag set to "noindex" on any page that you don't want indexed. If the page is already indexed, AND you use this <meta> tag and allow the page to be crawled in your robots.txt file, it will be removed from the index once it is crawled again.