Hello, Today I noticed my site was almost down because 7 bots (including yahoo and google) were crawling my site. How can I instruct them to make few requests once, to reduce the frequency or to make a pause between 2 requests?
You need a file called robots.txt on your root directory, and use the Crawl-Delay directive. Basically it allows you to specify an amount of time (in seconds) that Bots should wait before retrieving another page from that host. NOTE: Yahoo bot usually crawls larger sites from several IPs simultaneously. Example: User-agent: * Disallow: Crawl-Delay: 10 User-agent: ia_archiver Disallow: / User-agent: Ask Jeeves Crawl-Delay: 120 User-agent: Teoma Disallow: /html/ Crawl-Delay: 120 Code (markup):
Crawl-delay is fine for Yahoo but it's ignored by Googlebot. For Google you can choose the exploration speed from the Google Tools for Webmasters panel, it may help.
Is there a tool to track bots on my site? Which one you recommend? I need to find the bad ones, who open too many requests once and deny their IP addresses.
I'm denying access to the following user-agents, because usually are used for people to steal content sites: "Wget" "HTTrack" "WebCopier" "WebSauger" "WebReaper" "WebStripper" "Web Downloader" "libwww-perl" "Python-urllib"
You need to add on your .htaccess file Example, denying a few user agents and an IP range: SetEnvIfNoCase User-Agent "WebCopier" dontlike SetEnvIfNoCase User-Agent "WebSauger" dontlike SetEnvIfNoCase User-Agent "WebReaper" dontlike # RufusBot Address: 64.124.122.224 - 64.124.122.255 SetEnvIf Remote_Addr "^64\.124\.122\.2(2[4-9]|[3-5][0-9])" dontlike Options -Indexes -Includes Order allow,deny Allow from all Deny from env=dontlike Code (markup):