1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

How to Stop Web Robot from Engines other than G and Y and MSN

Discussion in 'robots.txt' started by gdtechind, Oct 27, 2005.

  1. #1
    I was checking my website logs and have seen that there are many new crawlers and bots which are fetching lot of information while i dont want to get listed in those small engines which wont bring any traffic but will only load the server from time to time.

    I was wondering if they would follow the instructions of robots.txt ?

    and there are some scrapers as well who seems to fetch lot of information by crawling. any way to stop them ?

    And if someone could give the complete syntax for robots.txt to just allow

    Google
    Yahoo
    and MSN

    thanks in advance.

    dhaliwal
     
    gdtechind, Oct 27, 2005 IP
  2. wrmineo

    wrmineo Peon

    Messages:
    3,087
    Likes Received:
    379
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Check out http://www.robotstxt.org/wc/exclusion-admin.html for more assistance and information

    You can do specific "allows" and disallow all others.


    What to put into the robots.txt file
    The "/robots.txt" file usually contains a record looking like this:
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/
    Disallow: /~joe/

    In this example, three directories are excluded.

    Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records.

    Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

    What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

    To exclude all robots from the entire server
    User-agent: *
    Disallow: /

    To allow all robots complete access
    User-agent: *
    Disallow:

    Or create an empty "/robots.txt" file.

    To exclude all robots from part of the server
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/
    Disallow: /private/

    To exclude a single robot
    User-agent: BadBot
    Disallow: /

    To allow a single robot
    User-agent: WebCrawler
    Disallow:

    User-agent: *
    Disallow: /

    To exclude all files except one
    This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
    User-agent: *
    Disallow: /~joe/docs/

    Alternatively you can explicitly disallow all disallowed pages:
    User-agent: *
    Disallow: /~joe/private.html
    Disallow: /~joe/foo.html
    Disallow: /~joe/bar.html
     
    wrmineo, Oct 27, 2005 IP
  3. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #3
    This would specifically allow The Google Yahoo and MSN robots access to the entire site, while disallowing access to all other bots.
    User-agent: Googlebot
    User-agent: Slurp
    User-agent: Msnbot
    Disallow:
    User-agent: *
    Disallow: /
    Code (markup):
     
    exam, Oct 27, 2005 IP
    wrmineo likes this.
  4. gdtechind

    gdtechind Peon

    Messages:
    414
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #4
    thanks to both of you.

    but i wanted to ask one more thing.

    robots.txt will be obeyed by only nice robots from respectable engines. a scraper wont, so will it be good to ban his IP ?

    and anyone knows some easy way to ban an ip block from IIS on windows 2k3 server. they have slowed my website a lot in past few days
     
    gdtechind, Oct 27, 2005 IP
  5. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #5
    The only thing about banning IPs is that you may ban undeserving folk too, if they are in the same network or if IPs aren't static. As far as blocking in IIS, no idea, sorry :(
     
    exam, Oct 27, 2005 IP
  6. nightmare5liter

    nightmare5liter Guest

    Messages:
    79
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I've had quite a few "scrapers" on my site and wrote a small php code to exclude these scrapers. This works for crawlers that report themselves as java with different versions. With php I simply put this before the html and head tags

    <?php $agent = $SERVER['HTTP_USER_AGENT'];

    if eregi("java/" , $agent){
    exit();} ?>

    this seems to prevent the crawler from accessing anything other than the index page as they seem to follow links and always hit the home page first.

    This might not work for all of them but most of the scapers that hit my site report as java.

    As a downside if there are any browsers that have java/ in their user agent they will be excluded as well.
     
    nightmare5liter, Oct 27, 2005 IP
  7. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #7
    That works, but it's probably more efficient to do it at the server level.
     
    exam, Oct 27, 2005 IP
  8. 1-script.com

    1-script.com Well-Known Member

    Messages:
    805
    Likes Received:
    46
    Best Answers:
    0
    Trophy Points:
    120
    #8
    Exam, this is wrong. The User-agent sections should be swapped: * first, then specific agent. Otherwise you simply cancel the specific Disallow with the * that follows. Here is what it should look like:
    
    User-agent: *
    Disallow: /
    User-agent: googlebot
    User-agent: slurp
    User-agent: msnbot
    Disallow:
    
    Code (markup):
     
    1-script.com, Nov 30, 2005 IP
  9. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Actually what you've posted is incorrect :)

    According to the robots.txt standard, the robot needs to follow the first applicable rule. Let's just walk through what you have. The Googlebot comes knocking, and the first line "User-agent: *" says match any user agent, so the Googlebot says, "ok, that matches me" then comes the disallow which says disallow everything. At that point the Googlebot says, OK I'm outta here :) and does not even finish reading the robots.txt file.

    Specific rules always precede general rules. That way you specifically allow or disallow what you want to and then if a robot gets past that, (none of the specific rules apply to it) then the general rule gets applied. So the correct way to do this is specifically tell the 3 robots Google, MSN and Yahoo that they are allowed on the site, then tell everybody else to go away:
    User-agent: Googlebot
    User-agent: Slurp
    User-agent: Msnbot
    Disallow:
    User-agent: *
    Disallow: /
    Code (markup):
    EDIT: 1-script.com, re-reading your post, it appears that you may be confused about the disallowing and allowing. The Disallow: with nothing after it, *allows* access to the whole site, while Disallow: / denies access to the whole site.
     
    exam, Dec 1, 2005 IP
  10. Spearheadltd

    Spearheadltd Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I use Spyder Spanker that will stop anything you don't want to get in. Google it.
     
    Spearheadltd, Jun 1, 2012 IP