We host a number of websites, all of which have fairly complex ecommerce type functionality, with many products, SEO landing pages, search facets, different domains for different languages. We originally set up the servers to cope with the amount of traffic that we were expecting, however... Over the past 2 years, load from robots has increased MASSIVELY, a majority of the server resources in fact are just dealing with search engine robots. The problem is that the search engines seem to be accessing the site more often, they are often ignoring robots.txt and there are more search engines and robots being built all the time. Bing seems to be one of the worst culprits - often 10 request per second to the server! add that to Google at 6 requests per second, Baidu at 3 or 4, Yahoo occasionally, a South Korean engine and so on and so on... We will need to expand the infrastructure as the sites are starting to see performance issues at peak times. We've done all the things we should do - reducing the crawl rate in robots, optimising / caching content, adding canonical links, reducing the number of search facets available in robots, we've blocked Baidu completely, but I'm worried that this is going to just carry on getting worse? We can't block robots completely as SEO is crucial to our clients... What are other people's experiences with this? Do you have similar concerns? What can be done about it?
hi tiff, already load balancing... possibility for scaling is not the issue really, its more the worrying trend of the load that bots place on all servers in general...
Perhaps you should consider serving a time-delay cached page to visitors with search engine user-agents, via mod_rewrite, and not via PHP.