excluding robots from page counter?

Discussion in 'Site & Server Administration' started by Mat, Nov 16, 2005.

  1. #1
    I'm setting up a banner advertising system on my site. In order to track page impressions properly, I need to implement a counter in PHP which increments with every valid page impression, but does nothing if the page is being spidered by robots.

    How do I determine using PHP whether the page has been requested by a robot or a genuine visitor?

    Cheers,
    Mat
     
    Mat, Nov 16, 2005 IP
  2. Jako

    Jako Well-Known Member

    Messages:
    347
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    140
    #2
    I'd like to no this too. Sometimes on my forums I get all of the robots counted as active members, etc.
     
    Jako, Nov 16, 2005 IP
  3. just-4-teens

    just-4-teens Peon

    Messages:
    3,967
    Likes Received:
    168
    Best Answers:
    0
    Trophy Points:
    0
    #3
    not sure how but you could usethe user_agent string to determine if its a spider or not, i no it can be done
     
    just-4-teens, Nov 16, 2005 IP
  4. Mat

    Mat Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I've been digging around a lot, and it seems there's no quick and easy way to identify a robot.

    If you need to identify robots in real-time on your page then as far as I can see there are 2 ways:

    1. Maintain your own pageview logs in a database. Log every time a page is loaded in a database table, along with the user-agent. You then monitor the various user-agents (not too hard, there arent a great many) and maintain a table of "blacklisted" user-agents which you can identify as robots. Then when a page loads you can check the user agent against your blacklist and decide whether or not to increment your page counter. This is my preferred method.

    2. Use the get_browser() function in PHP, which returns a large object which has a property identifiying whether the user-agent is a robot/crawler. However this depends on maintaining an up-to-date browscap.ini file, which you may not have access to if it's a shared server.

    Anyone have any better ideas?

    Cheers,
    Mat
     
    Mat, Nov 17, 2005 IP
  5. just-4-teens

    just-4-teens Peon

    Messages:
    3,967
    Likes Received:
    168
    Best Answers:
    0
    Trophy Points:
    0
    #5
    im not expert at php, infact i know hardly anything so the code below probally means f all, but could u do this?
    
    <?php
    $user_agent = GetVar("HTTP_USER_AGENT",  "");
    
    if {'$user_agent'} = googlebot 2.1 (add whatever is needed to not count hit here)
    ?>
    
    PHP:
     
    just-4-teens, Nov 17, 2005 IP
  6. Mat

    Mat Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Yes this is the general idea. Unfortunately there are hundreds of different user-agents for robots, which is why we need some sort of lookup (either database as I suggested, or text file as used by the get_browser() function). I'm not trying to count hits for a specific robot, I'm trying to count hits for when the user-agent is NOT a robot - in order to get more accurate stats for banner impressions.

    Mat
     
    Mat, Nov 17, 2005 IP
  7. just-4-teens

    just-4-teens Peon

    Messages:
    3,967
    Likes Received:
    168
    Best Answers:
    0
    Trophy Points:
    0
    #7
    how about go thru your server logs?

    if you can download a copy of awstats (stats program) it will have an up-2-date list of all SE robots.
     
    just-4-teens, Nov 17, 2005 IP
  8. Mat

    Mat Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    I already use Urchin - which does fairly good stats.

    I'm not familiar with awstats, but most stats programs (including Urchin) are not real-time (they update overnight). I'm talking about tracking real-time stats here. In order to properly manage my ad banner system I need real-time stats, and to be able to determine browser/robot in real time.

    Cheers,
    Mat
     
    Mat, Nov 17, 2005 IP
  9. just-4-teens

    just-4-teens Peon

    Messages:
    3,967
    Likes Received:
    168
    Best Answers:
    0
    Trophy Points:
    0
    #9
    what i meant was to search the awstats source code to find a list of the search engines user_agent string, then you have a pre-built (up 2 date if latest version) list of what user_agent is what
     
    just-4-teens, Nov 17, 2005 IP
  10. skattabrain

    skattabrain Peon

    Messages:
    628
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #10
    ya know ... you can always look for the manufacturing operating system too ... .... seems there are only 3 you really need to know.

    it's not 100% ... but it's mostly either MS, MAC or Linux.

    quick and dirty ... i use this to show flash navigation for real people, and plain html for bots/odd browsers.

    
    
    <?php $client = $_SERVER[HTTP_USER_AGENT]; if(strstr($client,"Windows") || strstr($client,"Macintosh")) { trackit();} ?>
    
    
    Code (markup):
    not sure what linux woulds show up as ... but if the user-agent contains windows or macintosh ... it fires.
     
    skattabrain, Nov 17, 2005 IP
  11. skattabrain

    skattabrain Peon

    Messages:
    628
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #11
    skattabrain, Nov 17, 2005 IP