Hi, I am doing a website in php that is totally dynamic. i.e. None of the file names that I have correspond to any particular webpage. Rather these are scripts that generate pages dynamically based on a GET or POST request. Here is an example I can provide: Say I have a file called aboutus.php, but none of my web pages contain the link http://www.my_web_site.com/aboutus.php. Rather say I have something like http://www.my_web_site.com?goto=aboutus So if I have all webpages generated like this on the fly, what I want to know is that is robots.txt relevant for my site? Also one more thing I want to know is does the search engine crawler (or robot if you prefer) have access to the hard drive? i.e. is the whole web directory visible to the crawler or does it have visibility only to the links I provide from a starting web page? So in the case above, even though I do not have aboutus.php explicitly mentioned anywhere on any of my web pages, will the crawler still be able to know that I have the file aboutus.php in my web directory? If that is the case then it seems that robots.txt is relevant. Otherwise it doesn't seem to be. Thanks a bunch for any feedback on this.
The robots.txt file is also called the robots exclusion file. The intent is to tell a spider which directories and files it should NOT index. This has nothing to do with a dynamic site versus a static page site. Whether you wish to exclude any directories or not, it is a good idea to use the robots.txt file because most spiders look for it and if it isn't there, an error 404 is generated which shows up in your error logs. One word of caution: do not depend on the robots.txt file to absolutely exclude files or directories. Many spiders, including Google's and Yahoo's, frequently ignore the exclusions placed in the robots.txt file. Spiders do not always check the robots.txt file every time they visit. Some, such as GoogleBot, only check it "periodically". A spider has no special access to a hard drive or your directories. A spider can only find links to files and it finds the files by following the links. On the other hand, there is some evidence that the Google and Yahoo toolbars are used to collect new URLs, which the search engines send spiders to crawl. I've seen sites with pages that showed up in a search engine index even though there were no links to the pages. If you want to block spiders from accessing certain pages, it's a good idea to use the robots meta tag on each page that you do not want to show up in an index. To prevent spiders from indexing a page: <meta name="robots" content="noindex,nofollow"> Code (markup): If you have your rewrite routines set up properly, a spider should only find files based upon the URL format you provide in your links. When the spider requests the new URL, it is re-written the same as any other request for a page.
Thanks for the detailed feedback. But I am wondering if a spider has no access to hard drive contents, how can it ever find the files to which there are no explicit links provided on any web page at all. In this case the robots.txt seems to serve no purpose at all, since after all it could be reasonably assumed that a webmaster will not provide explicit URL links to those contents that he doesn't want others to access in the first place, which in turn means that the crawler (or spider) will have no idea or knowledge that such files exist, because it cannot access the hard drive and find those files. Am I making any sesnse here? So in my original example how will it find that there is a file called aboutus.php, since there are no explicit URL links to that file? Oh, is that where you are mentioning the case of Google and Yahoo tool bars collecting URLs where a user might type an URL (for which no links exist anywhere on the web) and then crawl it, even though the webmaster wouldn't have intended those pages to be crawled? But this seems to be a rare case scenario to me.