I was checking my website logs and have seen that there are many new crawlers and bots which are fetching lot of information while i dont want to get listed in those small engines which wont bring any traffic but will only load the server from time to time. I was wondering if they would follow the instructions of robots.txt ? and there are some scrapers as well who seems to fetch lot of information by crawling. any way to stop them ? And if someone could give the complete syntax for robots.txt to just allow Google Yahoo and MSN thanks in advance. dhaliwal
Check out http://www.robotstxt.org/wc/exclusion-admin.html for more assistance and information You can do specific "allows" and disallow all others. What to put into the robots.txt file The "/robots.txt" file usually contains a record looking like this: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/ In this example, three directories are excluded. Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records. Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif". What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples: To exclude all robots from the entire server User-agent: * Disallow: / To allow all robots complete access User-agent: * Disallow: Or create an empty "/robots.txt" file. To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/ To exclude a single robot User-agent: BadBot Disallow: / To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: / To exclude all files except one This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory: User-agent: * Disallow: /~joe/docs/ Alternatively you can explicitly disallow all disallowed pages: User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
This would specifically allow The Google Yahoo and MSN robots access to the entire site, while disallowing access to all other bots. User-agent: Googlebot User-agent: Slurp User-agent: Msnbot Disallow: User-agent: * Disallow: / Code (markup):
thanks to both of you. but i wanted to ask one more thing. robots.txt will be obeyed by only nice robots from respectable engines. a scraper wont, so will it be good to ban his IP ? and anyone knows some easy way to ban an ip block from IIS on windows 2k3 server. they have slowed my website a lot in past few days
The only thing about banning IPs is that you may ban undeserving folk too, if they are in the same network or if IPs aren't static. As far as blocking in IIS, no idea, sorry
I've had quite a few "scrapers" on my site and wrote a small php code to exclude these scrapers. This works for crawlers that report themselves as java with different versions. With php I simply put this before the html and head tags <?php $agent = $SERVER['HTTP_USER_AGENT']; if eregi("java/" , $agent){ exit();} ?> this seems to prevent the crawler from accessing anything other than the index page as they seem to follow links and always hit the home page first. This might not work for all of them but most of the scapers that hit my site report as java. As a downside if there are any browsers that have java/ in their user agent they will be excluded as well.
Exam, this is wrong. The User-agent sections should be swapped: * first, then specific agent. Otherwise you simply cancel the specific Disallow with the * that follows. Here is what it should look like: User-agent: * Disallow: / User-agent: googlebot User-agent: slurp User-agent: msnbot Disallow: Code (markup):
Actually what you've posted is incorrect According to the robots.txt standard, the robot needs to follow the first applicable rule. Let's just walk through what you have. The Googlebot comes knocking, and the first line "User-agent: *" says match any user agent, so the Googlebot says, "ok, that matches me" then comes the disallow which says disallow everything. At that point the Googlebot says, OK I'm outta here and does not even finish reading the robots.txt file. Specific rules always precede general rules. That way you specifically allow or disallow what you want to and then if a robot gets past that, (none of the specific rules apply to it) then the general rule gets applied. So the correct way to do this is specifically tell the 3 robots Google, MSN and Yahoo that they are allowed on the site, then tell everybody else to go away: User-agent: Googlebot User-agent: Slurp User-agent: Msnbot Disallow: User-agent: * Disallow: / Code (markup): EDIT: 1-script.com, re-reading your post, it appears that you may be confused about the disallowing and allowing. The Disallow: with nothing after it, *allows* access to the whole site, while Disallow: / denies access to the whole site.