1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

identifying crawlers, avoiding sessions

Discussion in 'Programming' started by wolfpack, Mar 20, 2005.

  1. #1
    This seems like an obvious thing, I would have expected a post with a direct answer to this question, since it seems to come up a lot... but since I haven't seen one, here goes.

    On a large website (>10,000 pages) we want to keep track of the following:
    - Did they come from an affiliate?
    - Do they have anything in their shopping cart?
    - What path though our site did they take?

    I've read a number of pros and cons, and using PHP's built-in session support looks like the best way to do it. Including auto-rewriting the URLs with session IDs when cookies aren't available.

    BUT, session IDs aren't good when it comes to search engines.

    So, I won't use sessions at all if the browser is a crawler:

    
        $br = get_browser();
        if(!$br->crawler)
        {
            session_start();
            ...
    
    Code (markup):
    which works. BUT, it's been pointed out that get_browser() is surprisingly expensive, and that appears to be the case. Especially since I'm using solely to identify crawlers. And doing so on every page. On a site that gets a goodly amount of hits. SO....

    Has anyone rolled their own? (Way to avoid sessions with spiders, that is.) How did you do it? Is there a better way I haven't thought of yet?

    Thank you.
     
    wolfpack, Mar 20, 2005 IP
  2. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #2
    $_SERVER['HTTP_USER_AGENT'];

    The spiders you want to track reliably provide a user agent. No need to pull that whole object that get_browser() returns.
     
    noppid, Mar 20, 2005 IP
  3. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #3
    If crawlers maintain session cookies, creating sessions for them shouldn't be much of a problem for as long as their hits fall into the session timeout. That is, a single session created once a day per crawler is definitely less stress on the web server than looking up particular user agent strings in the user agent info.

    I sent a question to Google whether they do or do not maintain session cookies and will update everyone if I hear anything from them.

    J.D.
     
    J.D., Mar 20, 2005 IP
  4. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #4
    Please explain this "stress" :confused:
     
    noppid, Mar 20, 2005 IP
  5. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Isn't it rather obvious - 5-10 strpos($ua, $a_ua) calls for every PHP hit will have to use more CPU cycles than a single session created once an hour?

    J.D.
     
    J.D., Mar 20, 2005 IP
  6. wolfpack

    wolfpack Peon

    Messages:
    34
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #6
    But not one that says, "I'm a crawler". And I don't have a list. (Well, I do -- browscap.ini.)

    I don't think any search engines that do. Google used to not, and their webmaster documentation implies that they still don't:

    http://www.google.com/webmasters/guidelines.html
     
    wolfpack, Mar 20, 2005 IP
  7. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #7
    No it was not obvious at all, I had never seen your code and we were discussing the over head of get_browser(). :)

    But now that you point it out, I see your concern.
     
    noppid, Mar 20, 2005 IP
  8. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #8
    That's why I said "if crawlers maintain session cookies..." :)

    I saw their blurb, but I still find it hard to believe that they would make such a bad design decision. Cookies are a part of the HTTP standard now, after all.

    J.D.
     
    J.D., Mar 20, 2005 IP
  9. wolfpack

    wolfpack Peon

    Messages:
    34
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #9
    J.D., your comments gave me an idea. Here's my current solution:

    
        session_start();
    
        if (!isset($_SESSION["isCrawler"]))
            $_SESSION["isCrawler"] = get_browser()->crawler;
    	
        if($_SESSION["isCrawler"])
            ini_set('url_rewriter.tags', '');    # another way to suppress url re-writing
        else
        {
            ...
    
    Code (markup):
    (Re-edit of this post)
    Oops, that won't work. I'm assuming I have a session, but I'm suppressing it. Oh well, back to the blackboard....
     
    wolfpack, Mar 20, 2005 IP
  10. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I guess I should've been more specific :) I ran a quick test with strpos and 10 strpos calls take about 100 microseconds on my substandard test machine. This definitely can't be considered as stress in a general meaning of this word. However, on a busy website these 100 us may translate to dozens of milliseconds because string comparisons directly affect CPU usage.

    That being said, though, it is also possible to write code that will check if the user agent is a crawler *only* if there's no session information provided. In this case these string comparisons, whichever way they are implemented, won't be executed on every hit. If crawlers don't handle sessions, this is how I would approach this.

    J.D.
     
    J.D., Mar 20, 2005 IP
  11. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #11
    I was typing and didn't see your post :) Yep, that's what I was thinking about, except that you'd need to check if there's *any* session associated with the current request and if there isn't, check if the user agent is a crawler.

    J.D.
     
    J.D., Mar 20, 2005 IP
  12. wolfpack

    wolfpack Peon

    Messages:
    34
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #12
    OK, now I'm confused. If it is a crawler, I don't have a session (I suppressed passing the session ID from the previous page). If I don't have a session, I have to check to see if it's a crawler. So I'm back to checking on every page request.

    Unless you mean, this is where the "if they support session cookies" comes into play...
     
    wolfpack, Mar 20, 2005 IP
  13. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Here's what I was thinking about (think of non-FireFox browsers as crawlers :) ):

    if(!empty($_COOKIE["PHPSESSID"]))
            session_start();
    else if(strpos($_SERVER["HTTP_USER_AGENT"], "Firefox") != false)
            session_start();
    PHP:
    In this case string lookup will be performed only for new sessions and connections that don't support session cookies. Of course, one can make an argument that cookie lookup is still a lookup, but I would expect cookies to be stored in some kind of a hash table and will be quite quick to find regardless of the number of cookies.

    J.D.
     
    J.D., Mar 20, 2005 IP
  14. wolfpack

    wolfpack Peon

    Messages:
    34
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #14
    OK, bear with me I'm still kind of new to working with cookies. Are you saying that, even *prior* to my starting a session for this page, $_COOKIE["PHPSESSID"] would exist if there was previously a session for this user?

    Incidentally,

    I just wanted to point out here, the fact that you used a stand-in for the test of whether or not this UA is a crawler is the very crux of my original dilemma...
     
    wolfpack, Mar 20, 2005 IP
  15. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #15
    Assuming your check is after the session cookie has been set when something hits your page.

    Try and read the cookie
    If not available, assume crawler
    if available, regular user

    Of course the possibility of false positives exisit with so many folks blocking cookies these days.
     
    noppid, Mar 20, 2005 IP
  16. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Not on the first hit:

    1. browser > server; no cookie; server checks looks up user agent and calls session_start() if not crawler;

    2. server > browser; PHPSESSID is returned;

    3a. browser > server; sends PHPSESSID; server sees that PHPSESSID has been set and does not lookup user agent; calls session_start() to resume the session;

    3b. crawler > browser; no cookie; server repeats step 1

    Since session is maintained mostly using cookies (search args can be used as well), knowing that the cookie is in the request helps the server to avoid a supposedly more expensive lookup (I expect that calling, say, 10+ strpos requires more CPU cycles than a single cookie lookup).

    If you just need a few major crawlers, though, than it's probably easier just to scan user agent for strings like "googlebot".

    J.D.
     
    J.D., Mar 20, 2005 IP
  17. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #17
    It is true, although I'm hoping that over time people will do less of that with regards to session cookies, since these are never stored on the machine and don't really pose any security risks (well, no more than any other HTTP headers). Some sites use search args in this case to maintain the session.

    J.D.
     
    J.D., Mar 20, 2005 IP
  18. wolfpack

    wolfpack Peon

    Messages:
    34
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #18
    OK, I'll try it. Thank you.
     
    wolfpack, Mar 20, 2005 IP
  19. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Oh, yeah - missed this one. I don't use PHP much and can't say what would be a good lookup function to call. In general, for small number of agents, 5-10 or so, I would just write a function using strcasecmp() or stripos(). The former will work faster, but only with complete strings (i.e. it will return after comparing only first few characters for most requests). The latter will take longer, but may be used to find sub-stings, such as googlebot.

    J.D.
     
    J.D., Mar 20, 2005 IP
  20. wolfpack

    wolfpack Peon

    Messages:
    34
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #20
    OK, here's what I'm using now. If $_COOKIE["PHPSESSID"] exists, then sessions are ok to use. Otherwise I look up the user agents in my growing list to see if it's a crawler. If I don't find the user agent there, then I use get_browser(). This call reads the 240k browscap.ini file, so I only do it as a last resort. But it does give me the canonical answer. Which, now that I have it, I also add to my list for later.

    
    if(sessionsOK())
    {
        session_start();
        ...
    }
    
    function sessionsOK()
    {
        global $uaFile, $delimiter;
    
        if(isset($_COOKIE["PHPSESSID"]))
            return 1;
    
        $uaList = file_get_contents($uaFile);
        $toMatch = $_SERVER["HTTP_USER_AGENT"] . $delimiter;
        $matchPos = strpos($uaList, $toMatch);
        if($matchPos === false)
        {
            $br = get_browser();
            $isCrawler = (int) $br->crawler;        # 1 or 0
            $fp = fopen($uaFile, "a");
            fwrite($fp, "$toMatch" . $isCrawler . "\n");
            fclose($fp);
        }
        else
        {
            $isCrawler = substr($uaList, $matchPos + strlen($toMatch), 1);
        }
        
        return !$isCrawler;
    }
    
    Code (markup):
    enjoy. ;)
     
    wolfpack, Mar 21, 2005 IP