View Full Version : How to Stop Web Robot from Engines other than G and Y and MSN
gdtechind
Oct 27th 2005, 11:05 am
I was checking my website logs and have seen that there are many new crawlers and bots which are fetching lot of information while i dont want to get listed in those small engines which wont bring any traffic but will only load the server from time to time.
I was wondering if they would follow the instructions of robots.txt ?
and there are some scrapers as well who seems to fetch lot of information by crawling. any way to stop them ?
And if someone could give the complete syntax for robots.txt to just allow
Google
Yahoo
and MSN
thanks in advance.
dhaliwal
wrmineo
Oct 27th 2005, 11:16 am
Check out http://www.robotstxt.org/wc/exclusion-admin.html for more assistance and information
You can do specific "allows" and disallow all others.
What to put into the robots.txt file
The "/robots.txt" file usually contains a record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
In this example, three directories are excluded.
Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/docs/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
exam
Oct 27th 2005, 11:52 am
This would specifically allow The Google Yahoo and MSN robots access to the entire site, while disallowing access to all other bots.User-agent: Googlebot
User-agent: Slurp
User-agent: Msnbot
Disallow:
User-agent: *
Disallow: /
gdtechind
Oct 27th 2005, 12:46 pm
thanks to both of you.
but i wanted to ask one more thing.
robots.txt will be obeyed by only nice robots from respectable engines. a scraper wont, so will it be good to ban his IP ?
and anyone knows some easy way to ban an ip block from IIS on windows 2k3 server. they have slowed my website a lot in past few days
exam
Oct 27th 2005, 1:33 pm
The only thing about banning IPs is that you may ban undeserving folk too, if they are in the same network or if IPs aren't static. As far as blocking in IIS, no idea, sorry :(
nightmare5liter
Oct 27th 2005, 1:35 pm
I've had quite a few "scrapers" on my site and wrote a small php code to exclude these scrapers. This works for crawlers that report themselves as java with different versions. With php I simply put this before the html and head tags
<?php $agent = $SERVER['HTTP_USER_AGENT'];
if eregi("java/" , $agent){
exit();} ?>
this seems to prevent the crawler from accessing anything other than the index page as they seem to follow links and always hit the home page first.
This might not work for all of them but most of the scapers that hit my site report as java.
As a downside if there are any browsers that have java/ in their user agent they will be excluded as well.
exam
Oct 27th 2005, 1:45 pm
That works, but it's probably more efficient to do it at the server level.
1-script.com
Nov 30th 2005, 10:38 am
This would specifically allow The Google Yahoo and MSN robots access to the entire site, while disallowing access to all other bots.User-agent: Googlebot
User-agent: Slurp
User-agent: Msnbot
Disallow:
User-agent: *
Disallow: /
Exam, this is wrong. The User-agent sections should be swapped: * first, then specific agent. Otherwise you simply cancel the specific Disallow with the * that follows. Here is what it should look like:
User-agent: *
Disallow: /
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
Disallow:
exam
Dec 1st 2005, 8:35 am
Exam, this is wrong. The User-agent sections should be swapped: * first, then specific agent. Otherwise you simply cancel the specific Disallow with the * that follows. Here is what it should look like:
User-agent: *
Disallow: /
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
Disallow:
Actually what you've posted is incorrect :)
According to the robots.txt standard, the robot needs to follow the first applicable rule. Let's just walk through what you have. The Googlebot comes knocking, and the first line "User-agent: *" says match any user agent, so the Googlebot says, "ok, that matches me" then comes the disallow which says disallow everything. At that point the Googlebot says, OK I'm outta here :) and does not even finish reading the robots.txt file.
Specific rules always precede general rules. That way you specifically allow or disallow what you want to and then if a robot gets past that, (none of the specific rules apply to it) then the general rule gets applied. So the correct way to do this is specifically tell the 3 robots Google, MSN and Yahoo that they are allowed on the site, then tell everybody else to go away:
User-agent: Googlebot
User-agent: Slurp
User-agent: Msnbot
Disallow:
User-agent: *
Disallow: /
EDIT: 1-script.com, re-reading your post, it appears that you may be confused about the disallowing and allowing. The Disallow: with nothing after it, *allows* access to the whole site, while Disallow: / denies access to the whole site.
vBulletin® v3.6.8, Copyright ©2000-2008, Jelsoft Enterprises Ltd.