a few months ago - i had some spare time and surfed and found a simple perl script much different from what i expected and with a nice side effect. few basics - in case there are some newbies reading THIS. with simple access_log-file analyzing tools such as webalizer you detailed statistics about - entry/exit pages - referrer - countries visitors are originating - traffic details by hits, pages, KB, visits, hosts, ... and much more - how many hits for each individual page - error statistics - ... and of course a list of USED keywords when searching in SE to find your site thats all very nice + helpful but all these data LACK relevancy about answers to - DID EVERY surfer visiting my site from ANY SE query-result really find the ONE PAGE directly relevant to HIS query - is the page HE found FULLY answer HIS query ?? that was my key problem i wanted an answer a solution ... and the one who-is-online-script i found was access.pl the original is at http://mkruse.netexpress.net/ access.pl was the answer and FULL solution to my quest i modified it - adapted to more individual SE to give better SE overview in the browser output - more lines of output ( now 777 lines of access_log - that gives me approx 1 hr of access +/- depending on day time ) a modified version access.zip is available from my site for download at http://www.kriyayoga.com/logfiles/tools.html see top of page -- security ! how does it work ? with the command tail -f it takes a configurable number of access_log lines and displays it DIFFERENTLY - here how - grouped by IP - you see ONE IP/surfer as a group - with ALL the pages/files THAT one IP ( surfer ) was visiting - at the beginning the referrer ( highlighted SE with language affiliation for most google's and some others as well - these are easy changes i made and you see how when you open the script with a Unix compliant text editor and you can add or modify according to your needs ! ) - hence you see from WHERE each visitor is coming - what exact query words HE used for that search engine AND you see from the FIRST page he loaded IF he found the full answer to HIS query ! now let's assume he does NOT find full answer or totally wrong page .. worst scenario .. why ? - wrong or missing title / description or content irrelevant to description or keywords in meta - wrong / irrelevant search results from SE what to do ? change - correct - adapt your pages according to the NEEDS of surfers / SEO ! provide the fullest possible answer / solution to HIS query and problem in life - whatever service you offer YOU are the solution FOR your targeted group on this entire planet !!! who else ? less worst scenario .. you feel and know he found - but only part of his query - you see his FULL query + may be ONE or several of the terms are missing on THAT page .. but may be available ON your site .. then have links to all or any missing parts that HE can find on YOUR site ! adapt YOUR pages to meet full need of each or as many of YOUR visitors as possible. another chapter with the one answer he may have been missing another sentence to clarify one point and to make your results full and complete some other scenarios .. SE are still learning and having different algorithms - sometimes their combination of answers offered simply is wrong - hence nothing to do by you - but by SE. or surfer lacks knowledge about HOW to use SE and how to write a logical all inclusive query that really results in what he NEEDS or surfer lacks knowledge what he really is looking for !! YES that happens sometimes - some people just confused and LOST in life that they write queries that make no logical sense at all - or - they write full sentence just like you would ask a human .. but NEVER a computer at this early stage of PC development. these last points you can do little or nothing at all. but all above points show you a DIRECT relationship between ONE individual surfer with his full query and all the pages he surfed and hence you KNOW if he found ALL or part or nothing at all .. and you can adapt your site if needed all your pages to optimize CONTENT you also see WHERE exactly your 404 ( page not found ) are coming from .. and can adapt .. correct - or just KNOW that there is nothing to be done if case of URL is wrong or correct URL misstyped by surfer ... except in latter case you can make a custom 404 page with an ON SITE real text search engine you install to give each surfer another chance to find ON your site what he was looking for. re custom missing.html page ( configured in .htaccess ) every once in a while there is a NEED to "send that page on vacation" such as i do right now for a few months - to allow ALL SE to get a REAL 404 and to remove any and all wrong / outdated / missing pages from THEIR database - i think its a question of courtesy toward all SE to do so every once in a while. it has allowed me to reduce the number of error 404 form some 3% to less than 1 % and this remainder is caused by fancy silicon valley SW request asking for weird URLs like /MSOffice/cltreq.asp?UL=1&ACT=4&BUILD=4219&STRMVER=4&CAPREQ=0 or /icickm2004/ icickm2004-home.htm ( space IN url --->> after 2004/ and icickm ... - illegal URL !! ) OR /icickm2004/%20icickm2004-home.htm ( space in URL replaced by %20 -- ILLEGAL URL !! ) the latter TWO samples come from G - IF people COPY and PASTE URL rather than clicking URL ! NOTHING to be done for such and if you see that ALL your 404 are coming from such human errors by surfers - then you can accept whatever % of 404 you have ! these are all pages that may NEVER ever have been on your site but are requested by some browser software or surfers other point .. WHY is this access.pl tool in my security paragraph ? because of queries like -------------------- start log file excerpt form access.pl 12.38.79.66 - Mozilla/4.06 (Win95; I) Date Page Status Referrer 12/28 02:14 /cgi-bin/Mail.pl 302 12/28 02:14 /cgi-bin/FormMail.cgi 200 12/28 02:14 /cgi-bin/formmail.cgi 200 12/28 02:14 /cgi-bin/FormMail.pl 200 12/28 02:14 /cgi-bin/mail.pl 302 12/28 02:14 /cgi-bin/formmail.pl 200 12/28 02:14 /cgi-bin/Mail.cgi 302 12/28 02:14 /cgi-bin/mail.cgi --------------------- end here above we see ( i saw in real time ) a hacker attempt to search for an inexistent form mail perl script ( probably the one from Matt's perl archive ) to attempt abuse of such script for HIS spam .. other hacker attempts also have been observed a few times then you may act instantly by BLOCKING that IP in your .htaccess ( takes just a few seconds ) .. or just watch and smile if you know that your site is safe ! people "sneak" in and search in areas where NO link ever guides them - and you known and observe and may adjust permissions or access of such web site areas. one last point to access.pl its just ONE file you then call THAT full URL of access.pl in your browser and see it in your browser window. you may - like i did - use external style sheet to make it more colorful and easier to instantly READ change the number of lines and 2 lines only NEED to really be adapted one with the FULL absolute path to your access_log file on your server anther one with the full domain name ! that's it it takes just a very few minutes you may also see what i see re AOL .. each visit from ONE surfer is split into many page file-requests each one coming from different IP - which makes appear in YOUR log like MANY visitors use AOL in reality you may see that ONE AOL surfer may created up to about 20 different page requests originating from different IPs - making all belief ... belief is one part of creation - KNOWLEDGE another part because YOU know the direct relationship between all different files requested for ONE page and hence you will KNOW ... how small .. and how few .. access.pl helped me a lot during the past months i actually load it THE VERY first page each online sessions it also shows me instantly the nice SE visits by various Bots and HOW they crawl or LOOP ( i had once a direct observation of ONE German SE-bot looping some 4-5 THOUSAND times .. and because they are friendly - they have their URL - i emailed them the other day the problem was FIXED and their bot surfed free of any loop .. it like becoming guardian angel for bots and to help when needed and/or possible or to be a guardian angel of surfers who may use a word misspelled or ONE word where YOU write two and you adapt or add that joined word - to assure all spellings and misspellings related to YOUR site are found. you also see something interesting about google-bots how often they get a 304 or 200 and you see it appears obvious that the TTL ( time to live ) of their database is configured to BE short ! - because they visit my site daily with up to a few hundreds of pages crawled AND ... many times get a 200 - meaning they have DUMPED their previously crawled original and WANT a new one forced into their cache - because YOU will know that at least a few of those 200 pages called by Googlebot are still the same as a few weeks earlier during the previous visit. which also somehow MAY ( MAY BE .. ) explain WHY google dance -- why some pages DROP fully out and a day or few days later are IN and top 10 or so again - it may simply be that heir goal of FRESHNESS sometimes can NOT be met by heir own bot to RELOAD ( 200 ) that dumped page BEFORE it is dumped .. it also shows HOW time efficient a google bot CAN be and most of the time IS - a new page often is IN their search data base within less than 24 hrs after publishing WITHOUT submitting the new page - just by adding it in your navigation menus !!! god bless
Hi Hans, nice tool I will have a look at it here is one I use as nothing beats on-line realtime It's an adapted script from MnMDesigns shows visits last 4 minutes from where to and either direct or where from in detail. Klickable so I can immediately check the SE or source and the page they see. I run a couple of versions of it for high traffic domains and spend a bit of time checking if searches resolve to the right pages, what are positions in various SE's and more specific what curious query strings users come up with. If I have the time and see curious requess comming in I sometimes just ban the IP for 24h. Regarding the hoovers and mail slurpers most of my sites run various poison cgi scripts. The above will try to inject 100+ invalid e-mails if one comes along as it leads them along a range of domains. It's nice when one has a break or is on the phone watching what is going on and has helped a great deal to refine and be more relaxed about the whole thing. http://www.mnm-designs.com/main.php Cheers M
yes yours may be better for high traffic domains ( last 4 mins ) i djusted my number of access_log lines to approx 1 hrs so i am free to work and have a look every once in a while all those tools help to fine tune a domain and hence to feel in peace when taking off for a while. i also used once a mobile phone to check my domain from anywhre ( nokia 3650 ) .. to make life easier - i just uploaded a very recent view of my last 777 lines to give you an exact idea on how it looks like http://www.kriyayoga.com/logfiles/access.html and we see near the top of page 2 errors 404 created by /index.html%20/%20_top ---> a copy and paste of URL that included spaces form the search engine result page .. i will remove that page after a few days or week - in the mean time may it help decision making if unsure to install and try or leave .. peace of mind is important to enjoy life ! have fun
Hi Hans, looks OK I will check it out soon maybe a nice one for some of my clients. Regarding on-line you can adjust this to whatever you like it comes standard with 5 minutes but it's easy to change to 60 min and how many lines to show etc. The nice thing is it uses mysql and cleans up after itself so you don't build a huge db. I run it on a php screen that refreshes every 3 minutes. Oh and does some crude but nice predictions ond so on. Cheers M PS one of the few free tools that actually install cleanly and work out of the box (wel after adjusting the cfg file).
Using it on all my sites now, thanks for posting Hans. Very handy to catch rogue bots and track popular keywords used to find pages.
happy to see you like it and find it useful so am i again happy to have found someone to adapt it for me for my modified access_log form my 1and1.com hosting i was missing it for the past weeks since my move to 1and1 hosting - now since a few days i have it again so if anyone else has same hosting as i have now there is a version available of access.pl that can handle the additional 2 log format fields the very same applies to the well known webalizer for access stats, a friend from india - C/C++ developer - has modified webalizer too, to make it work again for the modified access_log format of 1and1 hosting
sorry guys for the outdated link but that was a truly old thread from 2004 but link corrected now currently I have no time to run the script on my servers - I simply have too much traffic and too much work to do otherwise but I used for years the script original access.pl and the 1and1 version the script no longer is maintained but still should work on most servers the newer version "whoisonline" = I never had time nor need to test since the original was working perfectly for me God luck