Just trolling for ideas to improve the management of logs for 95 sites hosted across various providers. All logs are in an Apache format, brought back to an in-house server. I'm using Webalizer to do individual site stats with a few custom scripts to resolve logs and group them by site. The big challenge I'm facing is storing all the logs and keeping them associated with the right hosts. One thought was to import them into a database, then extract for analysis. Right now I have them stored by hosting provider, then vhost domain and year. Another challenge is recombining logs from multiple servers, some drop logs daily, weekly, or monthly depending on traffic volumes. This means I'm basically rebuilding a central set of log files each time I process the logs. Currently using a C program called 'mergelog' when I update analytics. I'm looking for a solution that requires less horsepower or more easily is distributed. It would also be nice to take less disk space and time to update stats. What are you doing? Best, Justin Ps. I've looked at commercial solutions like Webtrends and NetIQ, the last couple of quotes were just way out of budget for the information I am using. I'd rather invest the money in building traffic.
I haven't looked at NetIQ but the last company I was with used WebTrends and DAMN was it expensive. We made each log directory something like /logs/sitename/web7/ depending on which server it was on. That way, all of them could be copied back to the log processing server and none of them would overwrite each other. We used a perl script to merge the log files together before processing. All of our sites rotated the log files daily and purged them after two weeks regardless of how big they were. (Actually, some of them rotated hourly because they would go over the 2G file size limit in a whole day) I read something a couple of years back about how Slashdot insert all of their log entries straight into a database from multiple web servers and process them from there. I think that would probably be a good way for you to go. I'm sure you could also store a simple numerical code for certain frequently accessed URLs instead of the whole URL in the database. You could probably also store the database files compressed to save even more space.
if you are running cpanel on the hosts you might want to take a look at: Webalizer Multi Server Website Stats you can get it here: http://www.scriptillusion.com/ have not tried it, but it might do the trick for you. Also, smarterstats from smartertools.com will work with multiple sites and much less expensive than webtrends. t
I have something similar after logs are downloaded from servers (i.e. /export/webstats/d/domain.com/YYYY/). Since my files are in different data centers I still have someone do this manually or via FTP scripts. Logs are serialized and tagged with the source server (i.e. SERVER-access_log-DOMAIN-YYYYMMDDhhmm.log.gz). Since I have several sites spanned across multiple data centers I'll process these logs into sorted temporary files for analysis (i.e. access_log-tmpNN-DOMAIN.log). I'm guessing your Perl works similar to my Bash scripts. For logs all in the same data center I've tested 'mod_log_mysql', but had problems pulling everything back out of the database to do the analysis since not all tools would talk to the database the same way. It was nice because a server (cluster member) could fail and you'd still have up to the minute error logs. Because my sites are in multiple data centers (US/TX, US/AZ, UK) I don't think I could use a direct to database method without a lot of overhead. One advantage to using the database is you can do Monthly, Quarterly, and Yearly summaries in another database to dump old logs to long term storage. Thanks for the confirmation that at least part of my process makes sense. It would be nice if all 9 servers were exactly the same, no matter the hosting company, but I'm letting them manage them so they are just close. Best, Justin
SmarterStats looks good, however, licensing jumps from 50 sites to 250. I have 72 active sites. Guess I'll need to test run the trial to see if it will work for my needs. Thanks. Best, Justin
you can get deals on the smarter stats and smartermail with hosting packages. I got a free 50 pack and then looking to use it to upgrade to the larger version. pm me if you need to find a place to buy it from. t
Thanks Tloosle, I didn't like the free demo of SmarterStats, it uses Microsoft Silverlight and was difficult to extract keywords from groups of sites in an automated fashion. I did like how it used FTP to pull all the files together, automating the data collection process from a single point of management. My back end development environment also has a number of Solaris and Linux boxes, didn't feel like SmarterStats could use those resources to process logs, its remote agents sit at the hosting provider. Best, Justin
I've settled on a combination of mergelogs, rotatelogs, and webalizer for processing my sites. I've found the 'vcommon' log format the most flexible for pulling stats from my virtual private servers. This solution lets me extract keywords and get basic traffic trends. For commerce sites I'm adding Google Analytics on the top end for detailed conversion stats. Thanks for all the feedback, it has helped. Best, Justin
Hi Justin, Do you have your logs set to process daily or hourly? What are you currently using (if anything) for just-in-time stats read-out? (for checking/altering to odd/off traffic patterns/levels)
Logs are processed several times during the month, depending on which site is being looked at. However, I do things a little differently looking primarily how much money each site generates, only really looking at the logs for optimization. If I have a drop off in revenue, I'll recompile the logs, then look at things in detail for that period. Because data from all servers wouldn't be available, it doesn't really help to look at stats more than once a month. Plus I see very little value loading 100 sites into Google Analytics. Best, Justin
Hi Justin, Typical turn around times for the revenue reports vary; usually a day or more. Using a real-time stats system with goals can potentially save a lot of revenue if there is a technical problem in your goal chain. On top of that one with an alert system (gostats for example) will be able to alert you in a much faster cycle and automatically. Just some food for thought.