as Googlebot is eating my bandwidth more than usual.. I sent them a c&p of my webalizer, but they replied stating they need detailed pages from my apache log! How can I do this without manually wasting time going through a 20mb log? Here's my January webalizer log for GBot: Any ideas?
you should first check that u aint got a dodgy page/script (forum with session ids??) somewhere and the google bot is getting trapped in a circle. i not sure how you would check, but im sure someone can help.
If you can figure out where it's using all that bandwidth, robots.txt would be a good solution on your side (without having to depend on G).
Find out which pages google was visiting and then decide if you want to allow it to visit all of these pages or not.
grep Googlebot access.log >> fileforgoogle.txt Will send all lines from access.log containing a useragent of Googlebot to a seperate file. Requires that you use the combined log format for apache where useragent is included in the main logfile.
Btw. I don't think that hit number sounds so bad if you have a few pages, I usually have +1500 hits/day from googlebot.
D'oh, I didn't think of that..lol.. Thanks. This will tell me if it's worth contacting Google again.. I have a 100gig limit a month (server wide), but my server currently hosts 4 sites, so I have to judge the bandwidth.. Thanks for the help
I have more dynamic pages than static pages, like 10,000+ (18,600 indexed by G).... Anyway, I'll use the grep suggestion then decide after......
True, I do, however most pages (even if they're dynamic) do not change content, its my home page that changes every page load - RSS feed stuff - so I'm not sure why GB visits those "static" pages? Unless of course the COOP Network links are taken into account for page changes?
If the page changes, even slightly, between crawls then G will increase the frequency of crawls as it thinks you are getting fresh content. The more frequently G spiders your site the more important it must think it is.