Urgent - Need to stop most spiders (not google or yahoo) to crawl my site, how?

Discussion in 'Site & Server Administration' started by yenerich, Apr 7, 2009.

  1. #1
    I just do not want to make google go away or something like that.
    But some unknow engines are taking all my bandwith (in one day, some "crawl" take 4,5 GIgas!!).

    Give me some advice please.
     
    yenerich, Apr 7, 2009 IP
  2. jestep

    jestep Prominent Member

    Messages:
    3,659
    Likes Received:
    215
    Best Answers:
    19
    Trophy Points:
    330
    #2
    jestep, Apr 7, 2009 IP
  3. yenerich

    yenerich Active Member

    Messages:
    697
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    75
    #3
    How can i, using robots.txt allow google, msn and yahoo and disallow the rest of the spiders.
     
    yenerich, Apr 7, 2009 IP
  4. jestep

    jestep Prominent Member

    Messages:
    3,659
    Likes Received:
    215
    Best Answers:
    19
    Trophy Points:
    330
    #4
    There's no way to. Robots.txt is a black-list style blocking method. You can only tell it what to block, not what to allow.
     
    jestep, Apr 7, 2009 IP
  5. yenerich

    yenerich Active Member

    Messages:
    697
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    75
    #5
    I will code myself some solution, similar to some i read somewhere:

    I will put some small image (1 pixel) that drives to some url. Also the link will be nofollow.
    A normal user will not see it and Google wont follow it.
    Each time someone go to that url will be banned and an email will be sent to me. So i can verify if leave the ban or remove it.
    I will see if it works well.
     
    yenerich, Apr 7, 2009 IP
  6. seoemporium

    seoemporium Well-Known Member

    Messages:
    947
    Likes Received:
    22
    Best Answers:
    0
    Trophy Points:
    108
    #6
    Check out http://www.robotstxt.org/faq/prevent.html
     
    seoemporium, Apr 18, 2009 IP
  7. yenerich

    yenerich Active Member

    Messages:
    697
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    75
    #7
    yenerich, Apr 18, 2009 IP
  8. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #8
    as a general rule robots.txt is a WISH list and NO way to CONTROL
    the spiders/bots listed on robots.txt are in no way forced to follow the rules specified in robots.txt

    if you wish to CONTROL access on a bots/ ( agents ) basis
    then you use .htaccess as shown below in an example. this way you allow for all agents ( bots) and block access for specific listed bots. this is full tru control as it works for all bots whether or not they want to obey robots.txt rules!

    if you google for bad bots then you find lists of bots that either are used for non-friendly purpose or just use your resources without benefit for you.

    to block on a agent by agent basis gives you 2 benefits

    all are welcome - hence NO accidental access-denial by unknown real SE bot
    all those that YOU find abusing your resources are added on a one by one basis depending on your own access_log records.
    if you look at your daily/weekly access-stats and see a new bot looping or abusing your resources, just add this new bot with a new line to your "stayout" list in your htaccess file.

    if a single SE bot hs GB of daily traffic, then usually a NEW SE with looping bots.
    if that new SE is of any value, then you may take the time to inform that SE with a few lines of your access_log file

    or if you consider that SE to be useless for your real / human traffic, you may simply add to the deny access list.
     
    hans, Apr 18, 2009 IP