how to go through a lot of data

Discussion in 'PHP' started by ferdousx, Sep 20, 2008.

  1. #1
    Hello guys.
    I am a php newbie and this is my first problem to digital point. Hope i will get some expert help.
    Here is my problem. In my php page i was trying to modify some data and show it to the users. Now the problem is the data size. the data is provided from a text file, each data in a single line. What the script need do is -

    1. read a line from the text file
    2. process it
    3. show a line saying line #n is processed
    4. go back to line 1 above to process the next data

    Now, the text file actually contains around 10,000 line!!! so when i tried to process it eventually after processing some data(around 10 line), it stops executing saying a message that maximum running time of the script(60 sec) is finished.

    can anybody tell me, how can i process this lots of data?

    Thanks in advance.
    -ferdousx
     
    ferdousx, Sep 20, 2008 IP
  2. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #2
    Well, this is usually where moving that data into a form of SQL would be called for, though I have to wonder what level of data processing you are doing that would take a minute to go through a measly ten lines.

    10,000 'records' of data is bupkis - the problem is likely your storage method.

    I'd have to see a sample of your data and what you are actually doing to it though to even come close to making any recommendations.
     
    deathshadow, Sep 20, 2008 IP
  3. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Well to get it so that the time does not run out, you can use set_time_limit(X) where X is the max number of seconds, so for instance set_time_limit(3600) will allow script execution for up to an hour. for the reading of the file line by line use

    Simply do your processing on the $buffer instead of the echo $buffer;
     
    JAY6390, Sep 20, 2008 IP
  4. classic

    classic Peon

    Messages:
    96
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    First 10000 lines is not of a big deal , you just need to know how to access it fast, read fseek PHP function to get it fast to some line and read it.
    Also you may want to set set_time_limit(0) so that PHP doesn't stop execution of script, though 60 seconds is a huge time , I have made some custom index for me with 100k records and accesing it with ID/Line Number is like 0.1 seconds

    The other better way is to explain me what exactly do you need I will help you solve the problem and you will be amazed how fast it can be done ;)
     
    classic, Sep 20, 2008 IP
  5. ferdousx

    ferdousx Peon

    Messages:
    168
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Hi. I cant set a predefined time as the number of input line varies, so as the processing time. And deathshadow,its actually a lots of string processing based on the input, also checking a lots of conditions.
    let me give u a simple case of what i want. I have been playing with this 2 simple form of my original script for about an hour.

    z2.php
    =====
    <?php
    if( !isset($_GET["var"]) )
    {
    $_GET["var"]=0;// $i is the column number
    }

    $_GET["var"]+=1;
    //taking the $_GET["var"] number line and do a lots of processing here
    //echo $_GET["var"].' is processed';
    $gotourl="z2.php?var=".$_GET["var"];
    header("Location: $gotourl");
    ?>

    why is not this script running? browser auto stops after sometimes as the page is redirecting to itself. What can make this script run.
    (of course i will provide some checking to see if end of the input is reached and stop then).
     
    ferdousx, Sep 20, 2008 IP
  6. tradeout

    tradeout Peon

    Messages:
    92
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Sent you a pm.

    Is there any reason you can't move the data into SQL?
     
    tradeout, Sep 20, 2008 IP
  7. ferdousx

    ferdousx Peon

    Messages:
    168
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #7
    sorry tradeout, noticed the pm but thought it was abt smthing else,so thought to read it later.
    the input is given from a txt file, no exception.But then yes, it can be transfered to sql if that helps.
    will not it give the same problem (time limit exceeding) if i read line one after another and input to database?
     
    ferdousx, Sep 20, 2008 IP
  8. classic

    classic Peon

    Messages:
    96
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Well from the code you gave its a bit awkward, it seems that you can acomplish this without going through redirects. If you can explain it in a words eg.
    so we can help you but you need to explain exactly what you need

    to avoid script execution time limit add this to the beginning of a PHP script
    
    <?php
    //set max exec time to 0  so there is no time limit
    ini_set("max_execution_time",0);
    
    //than do your stuff
    $line = @$_GET['var'];
    ......
    
    
    PHP:
     
    classic, Sep 20, 2008 IP
  9. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #9
    From the look of that code, your script is an infinite loop. You are redirecting back to the same page again and again with no condition being tested to see if you should redirect or not. Can you explain a bit clearer the data you have, and also the reason you are redirecting instead of using a loop inside of the script
     
    JAY6390, Sep 20, 2008 IP
  10. ferdousx

    ferdousx Peon

    Messages:
    168
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #10
    ok, here is my problem. the problem i got is, that the txt file includes around 1ok username. By adding the username after the url as a get method variable i go to the profile of that user. Then i grab the whole page code in a var and format it (by string operations) to extract the necessary infos i want about the users.

    so, its actually crwlng 10k page one after another and process each page. Its not a website work but data extrction wrk.

    I think this will give u idea why i face timeout.
     
    ferdousx, Sep 20, 2008 IP
  11. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #11
    <?php
    set_time_limit(18000); //Script will run for up to 5 hours if necessary (this is way over blown but just in case ;))
    $handle = @fopen("/tmp/inputfile.txt", "r");
    if ($handle) {
             while (!feof($handle)) {
            $buffer = str_replace("\n",'',fgets($handle, 4096));
            $url = 'http://www.thesite.com/path/here/'.$buffer;
            $pagecontent = file_get_contents($url);
            //process page here
             }
        fclose($handle);
    }
    PHP:
    That should work. I can't understand still why you are looping back to the same page. it's very strange, and as you have seen your browser generally wont allow it
     
    JAY6390, Sep 20, 2008 IP
  12. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #12
    So far, the code is gibberish and I'm not even certain what's trying to be accomplished - I'm just certain you're going about it all wrong.

    That endless horde of one line then refresh is a train wreck of code and is NOT how you should ever be attempting to do ANYTHING.

    Question, how big is the data file in total filesize? 10,000 lines should be well under a megabyte - It might be faster to just load the whole dataset in a single var, split it on CR, then process it all in one pass.

    In any case - handshakes alone are adding probably a full second to the processing of each 'line'. At bare minimum try to process more than one at a time BEFORE sending a header/refresh data/info since you are taking something that should take fractions of a second for the entire file to process and turning it into two and a half hours.

    A sample of the data set and what you are trying to do to it would be more helpful than the obviously broken method you are trying to use to read it in.
     
    deathshadow, Sep 20, 2008 IP
  13. classic

    classic Peon

    Messages:
    96
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Eh so JAY6390 has nailed your problem , and the part that is causing a time issue is
    as it depends on site response time and your net speed.

    So the main problem for you is not reading 10k lines of user names but getting the data + parsing it.
    If you parse it through REGEX it will kill your mashine.
    I recomend to use some HTML DOM PARSER in php look at google,
    than load page into parser and extract whatever you need from that page, for 10k visits to particular site can take up to
    10000 * getting_data_and_parsing_data = XXXX
    lets say that 'getting_data_and_parsing_data' for one page is around 3 seconds;
    10000 * 3 = 30000 sec or 8+ hours
    so this is small bot/crawler you want to make, and you don't need any HTTP GET thing.
    You just need a script that does the job and call it from a command line (there is no script time limit when you call it form a command prompt)
    of course you need PHP installed which is easy part

    As PHP is not multy threded and you want to speed up the whole process you can
    1. split 10k file to 10 files with 1000 users per file
    2. call crawlthesite.php 10 times but with different arguments eg
    crawlthesite.php file1.txt &
    crawlthesite.php file2.txt & .... & is to tell a command prompt to execute the script in background mode so you dont need 10 open command prompts

    to take the argument form a command promt in PHP do
    
    $file  = $argv[1];
    if( file_exists($file) )  {
      //do your work
    }
    
    PHP:
     
    classic, Sep 20, 2008 IP
  14. JEET

    JEET Notable Member

    Messages:
    3,832
    Likes Received:
    502
    Best Answers:
    19
    Trophy Points:
    265
    #14


    Instead of this above, you must make a function to get the line you need.

    <?php
    set_time_limit(0);
    function get_line(n){
    //find line and return
    return $line
    }

    $total_lines=10000;
    for($x=0;$x<$total_lines;++$x){
    $line= get_line($v);
    //process it, echo or whatever
    }
    ?>

    This will run much faster I think...
    Not sure what processing you are doing, but I think that by changing some code there, it can also be made to work faster.
    regards :)
     
    JEET, Sep 20, 2008 IP
  15. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #15
    You know, I'm looking at all the methods posted so far and not seeing much in the way of actual flow logic. While the OP's original method of working one line at a time is going to /FAIL/ on handshakes and overhead alone, it's entirely possible for the dataset to get too large to handle all at once too. (Though 10,000 is bupkis, and again without knowing the data or what's being done to it we are all guessing!)

    If indeed the data set processing takes too long, it might be prudent to break it into sets and use the refresh method - just those 'sets' should be more than one line. I would NOT use set_time_limit since while that does prevent php from dumping the script prematurely, it does not effect the time apache will consider for killing the session and disables something put in place for a good reason - preventing you from overloading the server. Instead, I'd use 'batch processing' to adjust the number of records handled to below the default 30 second limit.

    First, create a variable saying how many files to process per 'batch', start with 1000 and adjust from there. Then set up a handle to the file being processed, grab the filesize into a variable so if needed you are only calling the function once.

    We then check that the handle is actually pointing at a file - if it isn't we should have an else down bottom to notify the user of the problem.

    If it is (which should be the result of our first condition as it's the more likely result) we get the file position and store it in a variable, avoiding calling a rather slow functon more than once. Check to see if a getdata value of 'offset' is set, if it is fseek to that point.

    Then you do a while not at end of file and while numberToProcess>0 to call fgets to read in a line, process your data, then decrement numberToProcess.

    Next up we should check to see if we are at the end of the file - if we aren't (the more likely condition, put it first) output a full HTML feedback with a three second delay AND a HTML fallback anchor should the automatic refresh fail. Put the current file position (ftell) into a var to again save calling a slow function more than once. I'd use a meta refresh for that instead of header just to make it clearer - passing our 'offset' value'... and I'd make that refresh take at LEAST three seconds to give the server time to do other stuff. A little bit of delay to prevent overloading the server is always a good idea.

    In the feedback I'd also tell them how many bytes have been processed (our ftell value) and how big the file is. Since you'd have both variables available a more complete version could give them a progress bar.

    Of course, if we're at the end of the file, we let the user know it was completed.

    Which would look a little something like this:
    $numberToProcess=1000;
    
    $hFile=@fopen($pathToFile);
    
    if ($hFile) {
    
    	$fSize=filesize($pathToFile);
    
    	if ($_REQUEST['offset']) {
    		fseek($_REQUEST['offset']);
    	}
    	
    	while (!feof($hFile) && ($numberToProcess>0)) {
    		$line=fgets($hFile);
    		/* process your line data here */
    		$numberToProcess--;
    	}
    
    	if (!feof($handle)) {
    		$pos=ftell($hFile);
    		
    		echo '
    <html><head>
    	<meta http-equiv="REFRESH" content="3;url=process.php?offset='.$pos.'">
    	<title>
    		Processing '.$pos.' bytes out of '.$fSize.'
    	</title>
    </head><body>
    	Processing '.$pos.' bytes out of '.$fSize.'<br />
    	Please wait, process will continue in three seconds. If it does not, 
    	please <a href="process.php?offset='.$pos.'">click here</a>.
    </body></html>
    		';
    		
    	} else {
    	
    		echo '
    <html><head>
    	<meta http-equiv="REFRESH" content="0;url=process.php?offset='.$pos.'">
    	<title>
    		Processing Complete! '.$fSize.' Bytes Processed.
    	</title>
    </head><body>
    		Processing Complete! '.$fSize.' Bytes Processed.
    </body></html>
    		';
    		
    	}
    	
    } else { /* not found */
    	echo '
    <html><head>
    	<title>
    		ERROR - File Not Found
    	</title>
    </head><body>
    	An error in processing has occured - Unable to find file to parse.
    </body></html>
    	';
    }
    Code (markup):
    Though again, without seeing a sample of the dataset being processed, or the code you are using to process it, we're all making wild guesses here.

    To set the value of $numberToProcess I would keep increasing the value until you can dial in where it will fail, then divide it by two. So if it fails at 5000 at a time, set it to 2500. Engineering 101 - figure out how much load can be handled, then allocate half what can be handled as a safety margin. A lot better than letting a script dominate the server for an 'endless' period. I might not even wait for it to fail - a good rule of thumb is ten seconds. If it takes more than ten seconds for the user to recieve feedback, they will likely assume something is wrong.

    You could even trap time instead of a set number of items. After handling a line check the time, if the difference from when you started is under ten seconds keep going, if it's over, send the user where it's currently at and the refresh.
     
    deathshadow, Sep 21, 2008 IP
  16. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #16
    Actually, thinking on it more, I would DEFINATELY do it based on time. That way it scales to the processing time of each line item - some lines might take longer than others- and adjust to the server as you might move it from one server to another, or be developing on one server but deploying on another - save the headache of retweaking the value.

    $hFile=@fopen($pathToFile);
    
    if ($hFile) {
    
    	$fSize=filesize($pathToFile);
    
    	if ($_REQUEST['offset']) {
    		fseek($_REQUEST['offset']);
    	}
    	
    	$startTime=time();
    	
    	while (!feof($hFile) && ((time()-$startTime)<10) {
    		$line=fgets($hFile);
    		/* process your line data here */
    	}
    
    	if (!feof($handle)) {
    		$pos=ftell($hFile);
    		$gotourl="z2.php?offset=".$pos;
    		
    		echo '
    <html><head>
    	<meta http-equiv="REFRESH" content="3;url=process.php?offset='.$pos.'">
    	<title>
    		Processing '.$pos.' bytes out of '.$fSize.'
    	</title>
    </head><body>
    	Processing '.$pos.' bytes out of '.$fSize.'<br />
    	Please wait, process will continue in three seconds. If it does not, 
    	please <a href="process.php?offset='.$pos.'">click here</a>.
    </body></html>
    		';
    		
    	} else {
    	
    		echo '
    <html><head>
    	<meta http-equiv="REFRESH" content="0;url=process.php?offset='.$pos.'">
    	<title>
    		Processing Complete! '.$fSize.' Bytes Processed.
    	</title>
    </head><body>
    		Processing Complete! '.$fSize.' Bytes Processed.
    </body></html>
    		';
    		
    	}
    	
    } else { /* not found */
    	echo '
    <html><head>
    	<meta http-equiv="REFRESH" content="3;url=process.php?offset='.$pos.'">
    	<title>
    		File Not Found
    	</title>
    </head><body>
    	An error in processing has occured - Unable to find file to parse.
    </body></html>
    	';
    
    Code (markup):
    This way you know it will stop every ten seconds, give the server a three second breather (plus handshaking) and then continue.
     
    deathshadow, Sep 21, 2008 IP