1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Limit bot access frequency

Discussion in 'robots.txt' started by softarea51, Nov 1, 2007.

  1. #1
    Hello,

    Today I noticed my site was almost down because 7 bots (including yahoo and google) were crawling my site. How can I instruct them to make few requests once, to reduce the frequency or to make a pause between 2 requests?
     
    softarea51, Nov 1, 2007 IP
  2. ajsa52

    ajsa52 Well-Known Member

    Messages:
    3,426
    Likes Received:
    125
    Best Answers:
    0
    Trophy Points:
    160
    #2
    You need a file called robots.txt on your root directory, and use the Crawl-Delay directive.
    Basically it allows you to specify an amount of time (in seconds) that Bots should wait before retrieving another page from that host.
    NOTE: Yahoo bot usually crawls larger sites from several IPs simultaneously.

    Example:

    
    User-agent: *
    Disallow:
    Crawl-Delay: 10
    
    User-agent: ia_archiver
    Disallow: /
    
    User-agent: Ask Jeeves
    Crawl-Delay: 120
    
    User-agent: Teoma
    Disallow: /html/
    Crawl-Delay: 120
    
    Code (markup):
     
    ajsa52, Nov 1, 2007 IP
  3. Monty

    Monty Peon

    Messages:
    1,363
    Likes Received:
    132
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Crawl-delay is fine for Yahoo but it's ignored by Googlebot.

    For Google you can choose the exploration speed from the Google Tools for Webmasters panel, it may help.
     
    Monty, Nov 1, 2007 IP
  4. softarea51

    softarea51 Active Member

    Messages:
    203
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #4
    thank you all.
     
    softarea51, Nov 1, 2007 IP
  5. softarea51

    softarea51 Active Member

    Messages:
    203
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #5
    Is there a tool to track bots on my site? Which one you recommend?
    I need to find the bad ones, who open too many requests once and deny their IP addresses.
     
    softarea51, Nov 5, 2007 IP
  6. ajsa52

    ajsa52 Well-Known Member

    Messages:
    3,426
    Likes Received:
    125
    Best Answers:
    0
    Trophy Points:
    160
    #6
    I'm denying access to the following user-agents, because usually are used for people to steal content sites:
    "Wget"
    "HTTrack"
    "WebCopier"
    "WebSauger"
    "WebReaper"
    "WebStripper"
    "Web Downloader"
    "libwww-perl"
    "Python-urllib"
     
    ajsa52, Nov 5, 2007 IP
  7. softarea51

    softarea51 Active Member

    Messages:
    203
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #7
    How do you block an user-agent?
     
    softarea51, Nov 5, 2007 IP
  8. ajsa52

    ajsa52 Well-Known Member

    Messages:
    3,426
    Likes Received:
    125
    Best Answers:
    0
    Trophy Points:
    160
    #8
    You need to add on your .htaccess file
    Example, denying a few user agents and an IP range:

    
    SetEnvIfNoCase User-Agent "WebCopier"       dontlike
    SetEnvIfNoCase User-Agent "WebSauger"       dontlike
    SetEnvIfNoCase User-Agent "WebReaper"       dontlike
    
    # RufusBot Address: 64.124.122.224 - 64.124.122.255
    SetEnvIf Remote_Addr "^64\.124\.122\.2(2[4-9]|[3-5][0-9])"  dontlike
    
    Options -Indexes -Includes
    Order allow,deny
    Allow from all
    Deny from env=dontlike
    
    
    Code (markup):
     
    ajsa52, Nov 5, 2007 IP