How to Stop Web Robot from Engines other than G and Y and MSN

gdtechind Peon

Messages:: 414

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#1

I was checking my website logs and have seen that there are many new crawlers and bots which are fetching lot of information while i dont want to get listed in those small engines which wont bring any traffic but will only load the server from time to time.

I was wondering if they would follow the instructions of robots.txt ?

and there are some scrapers as well who seems to fetch lot of information by crawling. any way to stop them ?

And if someone could give the complete syntax for robots.txt to just allow

Google
Yahoo
and MSN

thanks in advance.

dhaliwal

gdtechind, Oct 27, 2005 IP

wrmineo Peon

Messages:: 3,087

Likes Received:: 379

Best Answers:: 0

Trophy Points:: 0

#2

Check out http://www.robotstxt.org/wc/exclusion-admin.html for more assistance and information

You can do specific "allows" and disallow all others.

What to put into the robots.txt file
The "/robots.txt" file usually contains a record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:

Or create an empty "/robots.txt" file.

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

To exclude a single robot
User-agent: BadBot
Disallow: /

To allow a single robot
User-agent: WebCrawler
Disallow:

User-agent: *
Disallow: /

To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/docs/

Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

wrmineo, Oct 27, 2005 IP

exam Peon

Messages:: 2,434

Likes Received:: 120

Best Answers:: 0

Trophy Points:: 0

#3

This would specifically allow The Google Yahoo and MSN robots access to the entire site, while disallowing access to all other bots.
User-agent: Googlebot
User-agent: Slurp
User-agent: Msnbot
Disallow:
User-agent: *
Disallow: /
Code (markup):

exam, Oct 27, 2005 IP

wrmineo likes this.

gdtechind Peon

Messages:: 414

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#4

thanks to both of you.

but i wanted to ask one more thing.

robots.txt will be obeyed by only nice robots from respectable engines. a scraper wont, so will it be good to ban his IP ?

and anyone knows some easy way to ban an ip block from IIS on windows 2k3 server. they have slowed my website a lot in past few days

gdtechind, Oct 27, 2005 IP

exam Peon

Messages:: 2,434

Likes Received:: 120

Best Answers:: 0

Trophy Points:: 0

#5

The only thing about banning IPs is that you may ban undeserving folk too, if they are in the same network or if IPs aren't static. As far as blocking in IIS, no idea, sorry

exam, Oct 27, 2005 IP

nightmare5liter Guest

Messages:: 79

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 0

#6

I've had quite a few "scrapers" on my site and wrote a small php code to exclude these scrapers. This works for crawlers that report themselves as java with different versions. With php I simply put this before the html and head tags

<?php $agent = $SERVER['HTTP_USER_AGENT'];

if eregi("java/" , $agent){
exit();} ?>

this seems to prevent the crawler from accessing anything other than the index page as they seem to follow links and always hit the home page first.

This might not work for all of them but most of the scapers that hit my site report as java.

As a downside if there are any browsers that have java/ in their user agent they will be excluded as well.

nightmare5liter, Oct 27, 2005 IP

exam Peon

Messages:: 2,434

Likes Received:: 120

Best Answers:: 0

Trophy Points:: 0

#7

That works, but it's probably more efficient to do it at the server level.

exam, Oct 27, 2005 IP

1-script.com Well-Known Member

Messages:: 805

Likes Received:: 46

Best Answers:: 0

Trophy Points:: 120

#8

exam said:
This would specifically allow The Google Yahoo and MSN robots access to the entire site, while disallowing access to all other bots.
User-agent: Googlebot
User-agent: Slurp
User-agent: Msnbot
Disallow:
User-agent: *
Disallow: /
Code (markup):
Click to expand...
Exam, this is wrong. The User-agent sections should be swapped: * first, then specific agent. Otherwise you simply cancel the specific Disallow with the * that follows. Here is what it should look like:
User-agent: *
Disallow: /
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
Disallow:
Code (markup):

1-script.com, Nov 30, 2005 IP

exam Peon

Messages:: 2,434

Likes Received:: 120

Best Answers:: 0

Trophy Points:: 0

#9

1-script.com said:
Exam, this is wrong. The User-agent sections should be swapped: * first, then specific agent. Otherwise you simply cancel the specific Disallow with the * that follows. Here is what it should look like:
User-agent: *
Disallow: /
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
Disallow:
Code (markup):
Click to expand...
Actually what you've posted is incorrect

According to the robots.txt standard, the robot needs to follow the first applicable rule. Let's just walk through what you have. The Googlebot comes knocking, and the first line "User-agent: *" says match any user agent, so the Googlebot says, "ok, that matches me" then comes the disallow which says disallow everything. At that point the Googlebot says, OK I'm outta here and does not even finish reading the robots.txt file.

Specific rules always precede general rules. That way you specifically allow or disallow what you want to and then if a robot gets past that, (none of the specific rules apply to it) then the general rule gets applied. So the correct way to do this is specifically tell the 3 robots Google, MSN and Yahoo that they are allowed on the site, then tell everybody else to go away:
User-agent: Googlebot
User-agent: Slurp
User-agent: Msnbot
Disallow:
User-agent: *
Disallow: /
Code (markup):
EDIT: 1-script.com, re-reading your post, it appears that you may be confused about the disallowing and allowing. The Disallow: with nothing after it, *allows* access to the whole site, while Disallow: / denies access to the whole site.

exam, Dec 1, 2005 IP

Spearheadltd Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#10

I use Spyder Spanker that will stop anything you don't want to get in. Google it.

Spearheadltd, Jun 1, 2012 IP

Log in or Sign up

How to Stop Web Robot from Engines other than G and Y and MSN

gdtechind Peon

wrmineo Peon

exam Peon

gdtechind Peon

exam Peon

nightmare5liter Guest

exam Peon

1-script.com Well-Known Member

exam Peon

Spearheadltd Peon

Useful Searches