Do you have similar experience with Yahoo bot (Slurp). I have 500 hits a day from yahoobot, expecting to be indexed all my pages which is around (23000) and only indexed 1800 pages. 1.2 gb bandwidth a day is outrageous. So what is this, all about? I'd appreciate if you share your experiences with slurp recently, Shall i change the slurp crawl settings in robot.txt? What would you advise? thanks a lot
Yes, 1.2 gb bandwidth a day is outrageous. I recently ran out of bandwidth on a site I believe Slurp was the part of the problem. Slurp has been using up the same as the other search engines combined. Have you given Yahoo a sitemap. You should be able to solve your problem with a sitemap.
Same here. One of our smaller sites all of a sudden had 20K 'uniques' according to Urchin. Turns out to be 90% Slurp. The website only has 100 or so pages and maybe totals 10mb and yet they managed to eat around 400 meg worth of bandwidth. Slurpy must have gotten confused.
Are any of you submitting a sitemap/feed to Yahoo's Site tool? I'm not for this site but I am for another site which is also hit heavily by Slurpsy.
I've also seen ways to slow slurp down (w/o denying it) using robots.txt This can be a good way to limit how much it munches... If i can find the article I'll post it later (or someone else may have it).
Heh, I wrote in my blog, back in April, about my server getting the Slurp! Hump! and it's still happening.
http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html User-agent: Slurp Crawl-delay: XX (where XX is a number for the delay) in robots txt that will at least slow it down and might help ?
There's nothing much you can do, other than what vagrant has suggested. By the way, just 500 hits by slurp takes away 1.2 GB bandwidth? 500 hits? That's a bit strange.
Do you have lots of pages that link to each other? Slurp is probably following every single link, and not recognizing that they've already cached/read the page ... resulting in some pages being read many times. This is what I've seen on my blog when I've viewed the logs.
A good way to limit search bot traffic (as well as user bandwidth) is to make sure your server headers and time-stamps are correct. Specifically: "Last-Modified" should be only updated when the content actually changes, and "Expires" should be set to a time and date when you expect the content to change again (1 day or 1 week after the Last-Modified). Static HTML files will be handled properly by the webserver, so don't worry about them. Many dynamic systems output a Last-Modified time which is always the current time, regardless of the actual time it was changed. Even more advanced is to check the client headers for "If-Modified-Since", and if the modification date of the content is not newer than this date, then return HTTP 304 "Not Modified", and exit. The client (search engine or browser) will then use the copy that it has cached from a previous visit. If you have a XML, RSS or ROR sitemap, make sure that you have accurate timestamps. You don't want the search engines thinking that your content has changed when it hasn't. Finally, session ids are a real killer for search engines, so make sure you detect their user agents and don't get give one. Hope that helps, Cryo.
Yes they are linked to each other around 24000 dynamic pages. Btw it is more than 500 hits, my stats show 500 uniques from yahoo and 10 times more hits. How many bots, do yahoo have and how can it be 500 uniques. thanks cyrogenius, that explains a lot about my dynamic pages but how can I prevent the system not to change the last modified date for dynamic pages. Thank god, i am not using session ids for search engines. Btw thanks a lot, vagrant, I will try changing the robots.txt file and see what happens, i hope it will help a bit..
I created a couple of new sites that have a lot of dynamic pages on them. I got to start paying attention to this as well or I could be getting myself into trouble. I just wish Google would do my site like this as well as yahoo.
Yep one of my friends sites is having the same problem I wonder if Yahoo is going for a SERP makeover IT
thank goodness that my host doesnt have bandwidth restrictions otherwise i'd have been fried. Slurp was eating up 3.4 gbs of average daily bandwidth on my 90k+ site last week.