Does anyone have a good list of email harvesting and other bad bots that should be blocked in their robot.txt file?
Dont think you are able to block them by using a robot.txt file. Do you check/read your log files? If so Look up there IP and bann the harvesters. That will help at first but most of times they use multiple IPs so you will have to bann them all.
There are crap loads to look through, was just wondering if anyone had a set of standards used for every site
That list won't do anything unless the spider recognizes and adheres to the robots.txt rules, which I susppect most e-mail harvesters ignore. robots.txt is just a text file that needs to be read by the spider. It is a voluntary method for excluding spiders. crazyhorse is right. You need to block them using their IP addresses.
To save bandwith The above microsoft listing isnt the normal msn user agent. It uses a different name,
Just to ponder on an idea, rather than intentionally blocking some robots, how about block all robots, and only allow the few major bots through? thx, tom
I try to keep the list at What are some bad web robots? updated. Yannow you only have to say DISALLOW once?