hans
Mar 24th 2004, 9:41 pm
a few months ago - i had some spare time and surfed
and found a simple perl script
much different from what i expected and with a nice side effect.
few basics - in case there are some newbies reading THIS.
with simple access_log-file analyzing tools such as webalizer you detailed statistics about
- entry/exit pages
- referrer
- countries visitors are originating
- traffic details by hits, pages, KB, visits, hosts, ... and much more
- how many hits for each individual page
- error statistics
- ...
and of course
a list of USED keywords when searching in SE to find your site
thats all very nice + helpful
but
all these data LACK relevancy about answers to
- DID EVERY surfer visiting my site from ANY SE query-result really find the ONE PAGE directly relevant to HIS query
- is the page HE found FULLY answer HIS query ??
that was my key problem i wanted an answer a solution ...
and the one who-is-online-script i found was access.pl
the original is at
http://mkruse.netexpress.net/
access.pl was the answer and FULL solution to my quest
i modified it - adapted to more individual SE to give better SE overview in the browser output - more lines of output
( now 777 lines of access_log - that gives me approx 1 hr of access +/- depending on day time )
a modified version access.zip is available from my site for download at
http://www.kriyayoga.com/logfiles/tools.html
see top of page -- security !
how does it work ?
with the command tail -f
it takes a configurable number of access_log lines
and displays it DIFFERENTLY - here how
- grouped by IP
- you see ONE IP/surfer as a group - with ALL the pages/files THAT one IP ( surfer ) was visiting
- at the beginning the referrer ( highlighted SE with language affiliation for
most google's and some others as well - these are easy changes i made and you see how when you open the script with a
Unix compliant text editor and you can add or modify according to your needs ! )
- hence you see from WHERE each visitor is coming - what exact query words HE used for that search engine AND
you see from the FIRST page he loaded IF he found the full answer to HIS query !
now let's assume he does NOT find full answer or totally wrong page .. worst scenario ..
why ?
- wrong or missing title / description or content irrelevant to description or keywords in meta
- wrong / irrelevant search results from SE
what to do ?
change - correct - adapt your pages according to the NEEDS of surfers / SEO !
provide the fullest possible answer / solution to HIS query and problem in life - whatever service you offer
YOU are the solution FOR your targeted group on this entire planet !!! who else ?
less worst scenario ..
you feel and know he found - but only part of his query - you see his FULL query + may be ONE or several of the terms are missing on THAT page ..
but may be available ON your site ..
then have links to all or any missing parts that HE can find on YOUR site !
adapt YOUR pages to meet full need of each or as many of YOUR visitors as possible.
another chapter with the one answer he may have been missing
another sentence to clarify one point and to make your results full and complete
some other scenarios ..
SE are still learning and having different algorithms - sometimes their combination of answers offered simply is wrong - hence
nothing to do by you - but by SE.
or
surfer lacks knowledge about HOW to use SE and how to write a logical all inclusive query that really results in what he NEEDS
or
surfer lacks knowledge what he really is looking for !! YES that happens sometimes - some people just confused and LOST
in life that they write queries that make no logical sense at all - or - they write full sentence just like you would ask a human ..
but NEVER a computer at this early stage of PC development.
these last points you can do little or nothing at all.
but all above points show you a DIRECT relationship between ONE individual surfer with his full query and all the pages he surfed
and hence you KNOW if he found ALL or part or nothing at all ..
and you can adapt your site if needed all your pages to optimize CONTENT
you also see WHERE exactly your 404 ( page not found ) are coming from ..
and can adapt .. correct - or just KNOW that there is nothing to be done if case of URL is wrong or correct URL
misstyped by surfer ... except in latter case you can make a custom 404 page with an ON SITE real text search engine you
install to give each surfer another chance to find ON your site what he was looking for.
re custom missing.html page ( configured in .htaccess )
every once in a while there is a NEED to "send that page on vacation" such as i do right now for a few months - to allow ALL SE to get a REAL 404 and to remove any and all wrong / outdated / missing pages from THEIR database - i think its a question
of courtesy toward all SE to do so every once in a while.
it has allowed me to reduce the number of error 404 form some 3% to less than 1 % and this remainder is caused by fancy silicon valley SW request asking for weird URLs like
/MSOffice/cltreq.asp?UL=1&ACT=4&BUILD=4219&STRMVER=4&CAPREQ=0
or
/icickm2004/ icickm2004-home.htm ( space IN url --->> after 2004/ and icickm ... - illegal URL !! )
OR
/icickm2004/%20icickm2004-home.htm ( space in URL replaced by %20 -- ILLEGAL URL !! )
the latter TWO samples come from G - IF people COPY and PASTE URL rather than clicking URL !
NOTHING to be done for such and if you see that ALL your 404 are coming from such human errors
by surfers - then you can accept whatever % of 404 you have !
these are all pages that may NEVER ever have been on your site but are requested by some browser software or surfers
other point ..
WHY is this access.pl tool in my security paragraph ?
because of queries like
-------------------- start log file excerpt form access.pl
12.38.79.66 - Mozilla/4.06 (Win95; I) Date Page Status Referrer
12/28 02:14 /cgi-bin/Mail.pl 302
12/28 02:14 /cgi-bin/FormMail.cgi 200
12/28 02:14 /cgi-bin/formmail.cgi 200
12/28 02:14 /cgi-bin/FormMail.pl 200
12/28 02:14 /cgi-bin/mail.pl 302
12/28 02:14 /cgi-bin/formmail.pl 200
12/28 02:14 /cgi-bin/Mail.cgi 302
12/28 02:14 /cgi-bin/mail.cgi
--------------------- end
here above we see ( i saw in real time :) )
a hacker attempt to search for an inexistent form mail perl script ( probably the one from Matt's perl archive )
to attempt abuse of such script for HIS spam ..
other hacker attempts also have been observed a few times
then you may act instantly by BLOCKING that IP in your .htaccess ( takes just a few seconds ) ..
or just watch and smile if you know that your site is safe !
people "sneak" in and search in areas where NO link ever guides them - and you known and observe and may adjust
permissions or access of such web site areas.
one last point to access.pl
its just ONE file
you then call THAT full URL of access.pl in your browser and see it in your browser window.
you may - like i did - use external style sheet to make it more colorful and easier to instantly READ
change the number of lines
and 2 lines only NEED to really be adapted
one with the FULL absolute path to your access_log file on your server
anther one with the full domain name !
that's it
it takes just a very few minutes
you may also see what i see re AOL .. each visit from ONE surfer is split into many page file-requests each one coming from different IP -
which makes appear in YOUR log like MANY visitors use AOL
in reality you may see that ONE AOL surfer may created up to about 20 different page requests originating from different IPs -
making all belief ...
:)
belief is one part of creation - KNOWLEDGE another part
because YOU know the direct relationship between all different files requested for ONE page and hence you will KNOW ...
how small .. and how few ..
:)
access.pl helped me a lot during the past months
i actually load it THE VERY first page each online sessions
it also shows me instantly the nice SE visits by various Bots
and HOW they crawl or LOOP ( i had once a direct observation of ONE German SE-bot looping some 4-5 THOUSAND times ..
and because they are friendly - they have their URL - i emailed them
the other day the problem was FIXED and their bot surfed free of any loop ..
it like becoming guardian angel for bots
and to help when needed and/or possible
or to be a guardian angel of surfers
who may use a word misspelled or ONE word where YOU write two and you adapt or add that joined word - to assure all spellings
and misspellings related to YOUR site are found.
you also see something interesting about google-bots
how often they get a 304 or 200
and you see
it appears obvious that the TTL ( time to live ) of their database is configured to BE short ! - because they visit my site daily with up to a few
hundreds of pages crawled AND ... many times get a 200 - meaning they have DUMPED their previously crawled original and WANT a new
one forced into their cache - because YOU will know that at least a few of those 200 pages called by Googlebot are still the same
as a few weeks earlier during the previous visit.
which also somehow MAY ( MAY BE .. ) explain WHY google dance -- why some pages DROP fully out and a day or few days later are IN
and top 10 or so again - it may simply be that heir goal of FRESHNESS sometimes can NOT be met by heir own bot to RELOAD ( 200 )
that dumped page BEFORE it is dumped ..
it also shows HOW time efficient a google bot CAN be and most of the time IS - a new page often is IN their search data base within
less than 24 hrs after publishing WITHOUT submitting the new page - just by adding it in your navigation menus !!!
god bless
and found a simple perl script
much different from what i expected and with a nice side effect.
few basics - in case there are some newbies reading THIS.
with simple access_log-file analyzing tools such as webalizer you detailed statistics about
- entry/exit pages
- referrer
- countries visitors are originating
- traffic details by hits, pages, KB, visits, hosts, ... and much more
- how many hits for each individual page
- error statistics
- ...
and of course
a list of USED keywords when searching in SE to find your site
thats all very nice + helpful
but
all these data LACK relevancy about answers to
- DID EVERY surfer visiting my site from ANY SE query-result really find the ONE PAGE directly relevant to HIS query
- is the page HE found FULLY answer HIS query ??
that was my key problem i wanted an answer a solution ...
and the one who-is-online-script i found was access.pl
the original is at
http://mkruse.netexpress.net/
access.pl was the answer and FULL solution to my quest
i modified it - adapted to more individual SE to give better SE overview in the browser output - more lines of output
( now 777 lines of access_log - that gives me approx 1 hr of access +/- depending on day time )
a modified version access.zip is available from my site for download at
http://www.kriyayoga.com/logfiles/tools.html
see top of page -- security !
how does it work ?
with the command tail -f
it takes a configurable number of access_log lines
and displays it DIFFERENTLY - here how
- grouped by IP
- you see ONE IP/surfer as a group - with ALL the pages/files THAT one IP ( surfer ) was visiting
- at the beginning the referrer ( highlighted SE with language affiliation for
most google's and some others as well - these are easy changes i made and you see how when you open the script with a
Unix compliant text editor and you can add or modify according to your needs ! )
- hence you see from WHERE each visitor is coming - what exact query words HE used for that search engine AND
you see from the FIRST page he loaded IF he found the full answer to HIS query !
now let's assume he does NOT find full answer or totally wrong page .. worst scenario ..
why ?
- wrong or missing title / description or content irrelevant to description or keywords in meta
- wrong / irrelevant search results from SE
what to do ?
change - correct - adapt your pages according to the NEEDS of surfers / SEO !
provide the fullest possible answer / solution to HIS query and problem in life - whatever service you offer
YOU are the solution FOR your targeted group on this entire planet !!! who else ?
less worst scenario ..
you feel and know he found - but only part of his query - you see his FULL query + may be ONE or several of the terms are missing on THAT page ..
but may be available ON your site ..
then have links to all or any missing parts that HE can find on YOUR site !
adapt YOUR pages to meet full need of each or as many of YOUR visitors as possible.
another chapter with the one answer he may have been missing
another sentence to clarify one point and to make your results full and complete
some other scenarios ..
SE are still learning and having different algorithms - sometimes their combination of answers offered simply is wrong - hence
nothing to do by you - but by SE.
or
surfer lacks knowledge about HOW to use SE and how to write a logical all inclusive query that really results in what he NEEDS
or
surfer lacks knowledge what he really is looking for !! YES that happens sometimes - some people just confused and LOST
in life that they write queries that make no logical sense at all - or - they write full sentence just like you would ask a human ..
but NEVER a computer at this early stage of PC development.
these last points you can do little or nothing at all.
but all above points show you a DIRECT relationship between ONE individual surfer with his full query and all the pages he surfed
and hence you KNOW if he found ALL or part or nothing at all ..
and you can adapt your site if needed all your pages to optimize CONTENT
you also see WHERE exactly your 404 ( page not found ) are coming from ..
and can adapt .. correct - or just KNOW that there is nothing to be done if case of URL is wrong or correct URL
misstyped by surfer ... except in latter case you can make a custom 404 page with an ON SITE real text search engine you
install to give each surfer another chance to find ON your site what he was looking for.
re custom missing.html page ( configured in .htaccess )
every once in a while there is a NEED to "send that page on vacation" such as i do right now for a few months - to allow ALL SE to get a REAL 404 and to remove any and all wrong / outdated / missing pages from THEIR database - i think its a question
of courtesy toward all SE to do so every once in a while.
it has allowed me to reduce the number of error 404 form some 3% to less than 1 % and this remainder is caused by fancy silicon valley SW request asking for weird URLs like
/MSOffice/cltreq.asp?UL=1&ACT=4&BUILD=4219&STRMVER=4&CAPREQ=0
or
/icickm2004/ icickm2004-home.htm ( space IN url --->> after 2004/ and icickm ... - illegal URL !! )
OR
/icickm2004/%20icickm2004-home.htm ( space in URL replaced by %20 -- ILLEGAL URL !! )
the latter TWO samples come from G - IF people COPY and PASTE URL rather than clicking URL !
NOTHING to be done for such and if you see that ALL your 404 are coming from such human errors
by surfers - then you can accept whatever % of 404 you have !
these are all pages that may NEVER ever have been on your site but are requested by some browser software or surfers
other point ..
WHY is this access.pl tool in my security paragraph ?
because of queries like
-------------------- start log file excerpt form access.pl
12.38.79.66 - Mozilla/4.06 (Win95; I) Date Page Status Referrer
12/28 02:14 /cgi-bin/Mail.pl 302
12/28 02:14 /cgi-bin/FormMail.cgi 200
12/28 02:14 /cgi-bin/formmail.cgi 200
12/28 02:14 /cgi-bin/FormMail.pl 200
12/28 02:14 /cgi-bin/mail.pl 302
12/28 02:14 /cgi-bin/formmail.pl 200
12/28 02:14 /cgi-bin/Mail.cgi 302
12/28 02:14 /cgi-bin/mail.cgi
--------------------- end
here above we see ( i saw in real time :) )
a hacker attempt to search for an inexistent form mail perl script ( probably the one from Matt's perl archive )
to attempt abuse of such script for HIS spam ..
other hacker attempts also have been observed a few times
then you may act instantly by BLOCKING that IP in your .htaccess ( takes just a few seconds ) ..
or just watch and smile if you know that your site is safe !
people "sneak" in and search in areas where NO link ever guides them - and you known and observe and may adjust
permissions or access of such web site areas.
one last point to access.pl
its just ONE file
you then call THAT full URL of access.pl in your browser and see it in your browser window.
you may - like i did - use external style sheet to make it more colorful and easier to instantly READ
change the number of lines
and 2 lines only NEED to really be adapted
one with the FULL absolute path to your access_log file on your server
anther one with the full domain name !
that's it
it takes just a very few minutes
you may also see what i see re AOL .. each visit from ONE surfer is split into many page file-requests each one coming from different IP -
which makes appear in YOUR log like MANY visitors use AOL
in reality you may see that ONE AOL surfer may created up to about 20 different page requests originating from different IPs -
making all belief ...
:)
belief is one part of creation - KNOWLEDGE another part
because YOU know the direct relationship between all different files requested for ONE page and hence you will KNOW ...
how small .. and how few ..
:)
access.pl helped me a lot during the past months
i actually load it THE VERY first page each online sessions
it also shows me instantly the nice SE visits by various Bots
and HOW they crawl or LOOP ( i had once a direct observation of ONE German SE-bot looping some 4-5 THOUSAND times ..
and because they are friendly - they have their URL - i emailed them
the other day the problem was FIXED and their bot surfed free of any loop ..
it like becoming guardian angel for bots
and to help when needed and/or possible
or to be a guardian angel of surfers
who may use a word misspelled or ONE word where YOU write two and you adapt or add that joined word - to assure all spellings
and misspellings related to YOUR site are found.
you also see something interesting about google-bots
how often they get a 304 or 200
and you see
it appears obvious that the TTL ( time to live ) of their database is configured to BE short ! - because they visit my site daily with up to a few
hundreds of pages crawled AND ... many times get a 200 - meaning they have DUMPED their previously crawled original and WANT a new
one forced into their cache - because YOU will know that at least a few of those 200 pages called by Googlebot are still the same
as a few weeks earlier during the previous visit.
which also somehow MAY ( MAY BE .. ) explain WHY google dance -- why some pages DROP fully out and a day or few days later are IN
and top 10 or so again - it may simply be that heir goal of FRESHNESS sometimes can NOT be met by heir own bot to RELOAD ( 200 )
that dumped page BEFORE it is dumped ..
it also shows HOW time efficient a google bot CAN be and most of the time IS - a new page often is IN their search data base within
less than 24 hrs after publishing WITHOUT submitting the new page - just by adding it in your navigation menus !!!
god bless