1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

What am I running out of?

Discussion in 'PHP' started by Owlcroft, Aug 26, 2004.

  1. #1
    A php script needs to process a very large number of external web pages, say 10,000.

    It fetches a page, extracts some data, writes that data onto the end of a cumulating data file, waits till it's been one second since the last HTTP fetch, then starts in on the next page. It fastidiously closes every handle it opens, per page.

    The function set_time_limit(0) has been called near the top of the script (and safe mode is Off).
    SEMrush
    The script bangs merrily along till some number never exactly the same but typically close to 5,000 pages, then just dies.

    What resource am I running out of?

    I am at a loss here. I don't see how it can be memory, as the script doesn't use much itself and the array into which the pages are read is recycled. I don't see how it could be time, as the execute-time limitation is--supposedly--removed. I don't see how it could be handles, as those are closed. I don't see how it could be buffer space, as every echo statement is followed by both an ob_flush() and a flush() (and in any event I have not turned output buffering on). So what is happening?

    Any ideas?
     
    Owlcroft, Aug 26, 2004 IP
    SEMrush
  2. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,283
    Likes Received:
    2,599
    Best Answers:
    460
    Trophy Points:
    710
    Digital Goods:
    29
    #2
    Have you tried running it from a shell, or just via an HTTP request?
     
    digitalpoint, Aug 26, 2004 IP
  3. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Have you tried running it from a shell, or just via an HTTP request?

    Just by HTTP request.

    I've tried that both direct (using a browser) and indirect (using another php script to call it); no noticeable difference.

    From earlier, similar problem studies, I believe--though I haven't checked in this exact case, I'd still bet on it--that the fails always come during a page-fetch. I started using a simple file() command, then descended to writing a detailed function using socket-level calls. That didn't help (though, to my surprise, it was substantially faster than file(), so I continue to use it). It issues the fgets and never returns--no FALSE, no nothing, just script death.

    On the other hand, that's where the script spends the vast majority of its clock time--as opposed to cpu time--so it is overwhelmingly likely that if the fail is some sort of timeout, that's where it would occur.

    I have seen suggestions that even with php's set_time_limits used, there is a higher-up value, in Apache's httpd or some such place, that is not over-ridden. Could that be my problem? The phpinfo() dump doesn't show anything that I can recognize, but if it's over php's head, I reckon it wouldn't.
     
    Owlcroft, Aug 26, 2004 IP
  4. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,283
    Likes Received:
    2,599
    Best Answers:
    460
    Trophy Points:
    710
    Digital Goods:
    29
    #4
    If it's an option, try running it from the shell. I converted everything that takes more than 10 seconds to be executed from the shell (for example the "check all" process in the keyword tracker), and it all runs much more reliably. You can trigger the shell version from a web request even if you want to get trick (spin it off into it's own thread).

    But I would at least try to manually run it from the shell and see if that works better just as an initial test.
     
    digitalpoint, Aug 26, 2004 IP
  5. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I am, to be honest, not sure what that would mean here. I am on a shared server, and have no way that I now understand to access the shell, save perhaps through a cron command.

    But I feel like I'm not understanding fully what you mean. Could I impose on you for some education here?
     
    Owlcroft, Aug 28, 2004 IP
  6. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,283
    Likes Received:
    2,599
    Best Answers:
    460
    Trophy Points:
    710
    Digital Goods:
    29
    #6
    You can trigger stuff from the shell via PHP if you want. The example I mentioned with where the keyword tracker runs the "check all" processes for users as a PHP script from a shell (in it's own thread) is done like so:

    exec ("/usr/bin/php lookupall.php $query_type $user_id >/dev/null &");
    PHP:
    Basically just triggers a command, sends anything sent back to /dev/null (nowhere) and the & is the key thing for what I wanted to do... put it into it's own back ground process.

    With my lookupall.php shell script taking two command line parameters...
     
    digitalpoint, Aug 28, 2004 IP
  7. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #7
    In--
    exec ("/usr/bin/php lookupall.php $query_type $user_id >/dev/null &");
    PHP:
    --I deduce that the shell passes query_type and user_id as parameters to the called script lookupall.php.

    I am, though, less clear on the >/dev/null & elements. I would think that the first part, >/dev/null simply sends any generated (echo'ed) output to the bit bucket. I am less clear on the remaining ampersand. I am not a *nix person--I would have expected that that part of the command tail to affect standard error, but I gather not. You refer to it as putting the script into it's own background process--I take it (with care) that that is a *nix convention?

    Also: does the script need to be modified to account for the way the parameters are being passed to it (if it was intended to get them as part of the $_GET[] array)?

    Sorry to be dense.
     
    Owlcroft, Aug 28, 2004 IP
  8. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,283
    Likes Received:
    2,599
    Best Answers:
    460
    Trophy Points:
    710
    Digital Goods:
    29
    #8
    Yep, you are right about everything... the variables are for my script, and the & is a unix convention.

    There is a difference how the variables are read from the shell... they go to the $argv[] array.
     
    digitalpoint, Aug 28, 2004 IP
  9. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Thank you! I will see what happens first time I come up for air (and don't pass the time reading forums).
     
    Owlcroft, Aug 29, 2004 IP
  10. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Well, it turns out, according to my host, that what I'm running out of is memory, which is--to put it mildly--bloody-all strange.

    But they just upgraded to PHP 4.3.8 a week agao, about when my woes began, and lo, an inspection of the PHP bug logs reveals complainys about 4.3.8 having mysterious memory-growth problems.

    Of course, they email me back that it's *my* problem, apparently ignoring my reference to the PHP bug report. I have responded sufficiently, and we will see what we will see.
     
    Owlcroft, Aug 31, 2004 IP
  11. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #11
    I guess I'm too successful. It seems, on examination by them and me of various logs, that it's simply a case of a lot of hits on pages made by php. The assailants are, of course, the bots, so it's a good-news/bad-news thing: the good news is that we're indexing your 600,000-page site; the bad news is that we're indexing your 600,000-page site--at bot speeds.

    I have opted to open a second account, which will be placed on a different(and, one hopes, lower-load) shared server. I am just not making the money, at this point, to justify a dedicated-server account.

    If it turns out that this doesn't answer the need, I'll be mightily steamed.

    I just wish this "success" would translate into some income . . . .
     
    Owlcroft, Sep 1, 2004 IP