Some general reading: http://en.wikipedia.org/wiki/Robots.txt Essentially, it's a text file on the root of your domain that disallows/permits specific bots for accessing specific directories of your website. For example: User-agent: * Disallow: / HTML: Disallows all bots from indexing any of your site. User-agent: googlebot Disallow: User-agent: * Disallow: / Code (markup): This would allow google full access, but disallow others.
I have one question. Let's say I have hidden folders in my root directory and simply don't want SE's to spider and index it, i.e. I sell an ebook or software which is stored in a directory and simply don't want to be found by SE results. Now some guys are not dumb and would try out the following: www.mydomains.com/robots.txt and would see which directories are hidden and could easily access files I wanted to protect. What I do is store the thankyou/download page on another domain but nonetheless some might just scan thru maybe 100's or 1000's domains blindly and be lucky to find some valuable content somewhere. How do you protect this? Can't you make it that someone who types in the robots.txt URL directly into the address bar that he gets either an error page or gets directed somewhere else? Thanks for your input and ideas!
I've been wondering about this too actually. One thing you can do to protect your disallow folder is to make sure that who type in mydomain.com/myhiddenfolder/ won't see the contents of the directory. This can be done either by placing an empty index.html file, or setting the server permissions to disallow directory listings entirely. You may want to take a look at .htaccess files for that. To protect your ebooks and whatnot, you may want to look into some third-party php/asp scripts. There should be some that will be able to secure your content. Another possibility for protecting the robots.txt file is to have files with the .txt extension resolve to a php/perl script -- similiar to how some sites have .html files parsed as php files. Then, you can take a look at the user-agent and if it's a bot, display robots.txt; otherwise, display something else. ...But I'm not sure if Google would consider that as cloaking. I hope this helps, Michael
So is this what you would do if you had an images directory that you wouldn't want google to look at individually?
When I want to block certain files or directories with robots.txt but I don't want others to see the filenames I use wildcard in robots.txt From google: To block all urls that have a ? in the url Disallow: /*?* Code (markup): From me: So to block /admin/login.php do Disallow: /adm Disallow: /adm* Code (markup): block all files in /images/ directory Disallow: /imag Disallow: /imag* Code (markup):