What sort of a traffic amount would I miss if I kept this robots.txt file on my site? Shame it won't show in stats right away so I am trying to figure out how many small crawlers would drop the website from their index. Before I was disabling only but the list was growing too long. User-agent: Mediapartners-Google* Disallow: User-Agent: ArchitextSpider # Excite User-Agent: Ask Jeeves User-Agent: FAST-WebCrawler User-Agent: Freecrawl # euroseek.net User-Agent: Googlebot User-Agent: Googlebot-Mobile User-Agent: Googlebot-Image User-Agent: Adsbot-Google User-Agent: Gulliver # Northern Light User-Agent: ia_archiver User-Agent: InfoSeek User-Agent: Lycos User-Agent: msnbot User-Agent: Scooter User-Agent: Slurp Disallow: User-Agent: * Disallow: /
Thanks trichnosis I changed it back to what it was before. BUT I thought all the crawlers from the top were allowed, except the last "/" line as disallowed.
but the line before that is saying ALL with *. So you are basically saying allow all those ones you have listed, but then disallow all. I assume you are trying to keep all other robots from visiting. Remember that only "good" robots will "listen" to a robots.txt, and you have those listed. If a certain bot is causing you issues, just ban it in your .htaccess
This is not correct. "User-Agent: *" means all other robots, not mentioned in another rule. This is explained in the original robots.txt specification: Jean-Luc
Thanks guys. I am watching this and for now I use back on my previouse version of robots.txt file containing list of only disabled robots. I am adopting more knowledge about this! The file in this thread was used only for 2 weeks and I noticed 5% decrease in organic traffic; hard to say the reason as it well could be just arriving summer months.
It doesn't work that way. Try it for yourself. Set up a robots.txt file like the example given. Then use one of those spider simulators or page crawlers that allow you to set the User agent to whatever you wish. Then take note of what happens. You are also forgetting that only "good" bots like the ones that are listed will comply with a robots.txt command. The ones that cause trouble have to be blocked via .htaccess
I rely on the spec, not on the (maybe invalid) design of a spider simulator. I do not forget that. I fully agree with you on the need to use .htaccess for bad-intended bots. Jean-Luc
There are a lot of them out there, they are all invalid? When I disallow all, the tools cannot crawl. Considering they aren't Googlebot MSN or Slurp, it shouldn't allow them.. but it does. The spec you are relying on is not in control of any of the bots out there. It is a guide on how "well-behaved" bots work.