Car Insurance - Car Insurance - Loans - Consolidation Debt - Servidores

PDA

View Full Version : Any way to exclude include files from spiders?


TheConnollyKid
Jan 27th 2006, 9:48 am
I am working on a site and have various navigation sets being pulled in dynamically via a PHP include command. I tried using a robots.txt file to exclude the includes folder that contains all of the snippets that get pulled in, but it still looks like the navigation is getting indexed by the spider. Any ideas?

mcfox
Jan 27th 2006, 10:52 am
I don't see how you can exclude the navigation from the spider since it gets served when the spider 'views' the page. Why do you want to exclude the navigation anyway?

TheConnollyKid
Jan 27th 2006, 11:05 am
The reason i'd like to exclude the navigation is say that someone is searching for a term that refers to a specific product or service, and the main info for that comes up on a single page. But if that term is contained as a link in the navigation of 50 other pages which have nothing contextually to do with the product or service being sought, then the engine returns 51 results, with the latter 50 showing the navigation in the result where the term is used.

wrmineo
Jan 27th 2006, 11:07 am
Another thing to consider, is that unfortunately, not all spiders will abide by the robots.txt file :(

digitalpoint
Jan 27th 2006, 12:05 pm
There is no Allow: directive for robots.txt, only Disallow:, so no... you can't.

minstrel
Jan 29th 2006, 5:57 pm
Unless it's Googlebot - although I've never tried this myself, the Google robots.txt information page (http://www.google.com/webmasters/bot.html#robotsinfo), seems to suggest that Googlebot does recognize Allow:...

Why isn't Googlebot obeying my robots.txt file?

To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.

We always suggest verifying that your syntax is correct against the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect.

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.

For more information, please see the Robots FAQ (http://www.robotstxt.org/wc/faq.html). If there still seems to be a problem, please let us know (http://www.google.com/support/bin/request.py?form_type=webmaster&stage=fm&user_type=webmaster&contact_type=other_webmaster).
...unless I'm reading this wrong.