When I use the "site:" command to query MSN for pages in my domain I see pages that are excluded in my robots.txt file. Is it just me or is MSN ignoring robots.txt files?
Disallow: in robots.txt means "do not visit this page". It does not mean "do not index it". Several pages are indexed by Google, Yahoo, MSN and others without having been visited by a robot. This is in compliance with the Robots Exclusion Protocol. Jean-Luc
I'm familiar with that aspect of the Robots Exclusion Protocol [The value of this field specifies a partial URL that is not to be visited...]however I wasn't aware that a URL would be indexed without being visited/retrieved. Hmmm...
A disallowed page can be indexed because of the informations collected by the robot in other allowed pages containing links pointing to the disallowed page. The address of a disallowed page can be present in the SERP's, but there will be no cached version of the page. For example, if page "/blue-horse.html" is disallowed, there might be links like this in other pages : <a href="/blue-horse.html">Blue Horse</a> Code (markup): That's enough for a search engine to index "/blue-horse.html" and to show it in some SERP's. Jean-Luc