Do you block certain spiders? If so, why? I have been getting hits like: Hostname: spider-199-21-99-199.yandex.com Hostname: ec2-52-4-176-40.compute-1.amazonaws.com Hostname: static.227.10.9.176.clients.your-server.de Montego Bay, Jamaica (multiple hits) Hostname: 207.204.122.220 (France) Hostname: 195-154-240-246.rev.poneytelecom.eu (more foreign, then US, and I have no real need for foreign, as my target would be US....)
I block bots that are not recognized search engines because I neither need nor want them consuming server resources. Why should any webmaster pay to have bots crawl their site when they get nothing out of it? No revenue, no search engine traffic, nothing.
I can understand that, however, I know common bots like google, msn, yahoo... what I do not know is what the other bots do such as majestic12.co.uk (germany) which was the last bot to hit the site. I guess a need to find a list of bots to ban... otherwise, I could end up banning a bot that could end up being useful, or maybe not... I just dont know... I also noticed I get some traffic from flipboard, never heard of it until I see a bot that had flipboard in the title, apparently it is similar to pintrest, they spider your site for content snippets, and I am not sure if this is good or bad since this is how content can spread, it sends a few visitors...
We are blocking spiders on some of our own websites, but not on customer werbsites. Inserting this code into .htaccess or better directly into the VirtualHost-configuration of your website will block all useless bots and spiders that are only consuming your traffic and slowing down your website: RewriteBase / RewriteCond %{HTTP_USER_AGENT} (AhrefsBot|spbot|DigExt|Sogou) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (MegaIndex.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush) [NC,OR] RewriteCond %{HTTP_USER_AGENT} (MJ12|MJ12bot|MJ12Bot|Ezooms|CCBot|TalkTalk|Ahrefs) [NC] RewriteRule .* - [F]
certainly there are more, right? unless these are the biggest bandwidth killers.... I noticed several MJ spiders on earlier... google can have sex with my site all they want... other then yahoo, msn, etc. I do not think there are many more that are actually very useful to be honest....
I seen 5+ on earlier, I guess I should ban them... I do not see how Russian traffic would apply to a US market. that would be similar to me buying a something in Russia on a search rather then buying it on a site in the US.
Many years ago I had a site that tracked spiders and I'd research them and publish my findings. It quickly got out of hand and the project was abandoned. What I discovered (back then) Large companies had their own internal search engines Universities had experimental search engines Google had bots crawling from a seemingly unlimited number of IPs There were lots of countries who had their own search engines in their own language Not all bots requested or complied with robots.txt Not all bots had an info url in their useragent None of the bots flooded my shared servers I could waste a lifetime trying to keep track of all the bots and I'd be no further ahead If you've got so little bandwidth allocated to you that bots stealing your bandwidth becomes a problem then it's time to upgrade your hosting. On a cost benefit analysis the cost of the hosting will be far less than the value of your time wasted blocking bots. If you can't afford the hosting then you need to review your business plan and see if the business is worth continuing with in the first place.
how much bw that they are stealing was never my concern... but it does not mean I like wasting resources over worthless bots that provide zero value. not exactly sure what my bw is at the moment, but that never crossed my mind. I would imagine if you just simply neglect them, and they take more and more, then as the site grows, so does the bots, unless you simply just like giving away free resources... Why would anyone want to appear in foreign search engines if that is not their target? regardless how good the server is or not, i do not need anymore traffic on the server then need be at this time, on a test, i know my site can take more then 125 concurrent visitors at any given time, however, if 100 of those visitors are from russia because of yandex, how does that benefit me? I need us traffic on my site, not traffic as a result of useless foreign search engines....
And that's the rub. I've been down at my local supermarket as they've hauled 5 bottles of wine out of a woman's backpack as she was trying to leave. They let her go, didn't call the cops etc because the value of the time taken just to handle things right there and then exceeded the benefit - add in the time dealing with lawyers, turning up in court etc and prosecution becomes really expensive. Think of the junk bots as shoplifters. You know it goes on, you don't like it but the cost in time of stopping the problem exceeds any benefit.
When anyone is on a limited resource like a VPS or even a dedicated server and they are nearing max capacity, one of the first places to look for reductions is by blocking those useless bots.
I am on shared hosting, and to be honest, I was not concerned about bw until @sarahk brought it up... now I guess I got to go look, damn. I have other reasons to block evil bots then just bw. Shared hosting can take more then you think; I think last year or the year before, I had topped out with about 10 sites before I started having issues with the host, then common sense told me that you just can not effectively manage that many sites, and looking after one is far more effective. I have been on a VPS also, even a VPS can take a good beating, if you are on a vps, and you are near shutdown, then something really must be wrong, because I know even a vps can take quite a bit.... besides, I am watching bots live, so when I am not doing anything, such as watching tv, then it is no more work to ban them as they visit the site.... Someone just a whois query on my site, I am ok with that, no problem... You see, I am not a complete bastard you know....
popular or not, if it is made for us traffic, then no real need to block it unless I find it abusing my site.