Parse large XML files!

Discussion in 'PHP' started by Awilum, May 20, 2011.

  1. #1
    I need to parse large XML files ranging in size from ~ 500 to ~ 1700 Mb.

    I use XMLReader

    	
            set_time_limit(0);
    	
    	$start_time = microtime(true);
    
    
    	include_once 'inc/Misc.php';
    	include_once 'inc/Database.php';
    	
    	$files = array('xml/large_file.xml');
    	
    	
    
    	foreach($files as $file) {
    		
    		echo "\n";		
    		echo 'Filename: '.basename($file)."\n";
    		echo 'Filesize: '.convert(filesize($file))."\n";
    		echo 'Start parsing...'."\n";
    		echo "\n";
    
    		$reader = new XMLReader();
    			
    		$reader->open($file);		
    		
    			
    		while ($reader->read()) {
    		    switch ($reader->nodeType) {
    		        case (XMLREADER::ELEMENT):
    		        if ($reader->localName == "element-name") {		                
    		                $dom = new DomDocument();
    		                $n = $dom->importNode($reader->expand(),true);
    		                $dom->appendChild($n);
    		                $sxe = simplexml_import_dom($n);   
    		                $tess->file_big->insert($sxe);          
    		                echo "Insert done! "; benchmark();		                
    		        }		        
    		    }		    
    		}
    		
    	}
    
    Code (markup):
    Everything is fine in the beginning ...
    Parsed file and slowly inserted my desired data, but is gradually growing memory consumption and has run out of resources.

    That is, I took the file to 400 Mb and as long as it is spent parsing of 2000 Mb of RAM and all the resources ran out and the script is stopped.

    How to deal with large files? ~ 500 to ~ 1700 Mb.

    Will there XML Parser? Yes, and how to apply it to my problem?

    Another option could have?
     
    Awilum, May 20, 2011 IP
  2. ssmm987

    ssmm987 Member

    Messages:
    180
    Likes Received:
    4
    Best Answers:
    3
    Trophy Points:
    43
    #2
    Loading big files in php - and all other programming languages - is not a smart thing to do. 500 to 1700 mb big xml files are just to big. The best thing to do is to use json (Although still big) or a mysql database instead.

    If you are desperately working for a workaround, you can maybe do something with fread or fgets(s) or something, but that requires a complicated script to handle the file.
     
    ssmm987, May 20, 2011 IP
  3. Aotearoa

    Aotearoa Member

    Messages:
    40
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #3
    There's two main approaches to using XML input. The first is to parse the whole XML document into an in-memory data structure as DOM or the php SimpleXML packages do. This is usually much easier to write the surrounding code for but uses a lot of memory, usually some single integer multiple of the input document size.

    The other approach is to use a streaming parser. This means that as each element and sometimes attribute is found the parser calls one of your routines to handle it. You end up writing a lot more code, but can efficiently process fairly large files. James Clark's Expat parser is a well known example of this approach and is available in Php as the "XML Parser" package. I've not used this php extension but I have used Expat directly from C/C++ and know it can easily handle 100Mb + files. To deal with large files you need to have a read loop that passes the source file in in "chunks" to the parser.

    Of course it is still going to take a long time to run.

    As others have mentioned, you'll probably find it better to break your XML into smaller chunks and process these individually or work from a database that you've pre-loaded your source document(s) into.

    HTH

    Bruce
     
    Aotearoa, May 20, 2011 IP
  4. Awilum

    Awilum Member

    Messages:
    26
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    36
    #4
    I rewrote the script. Frees memory as I can. For data storage use MongoDB.
    General problem that runs the memory remains. But this should not be with XMLReader i think.

    xmlreader.php
    
    <?php
    
    	$start_time = microtime(true);
    
    	include 'inc/Misc.php';
    	
    	logAdd('Script start');
    
    	set_time_limit(0);
    	
    	 // Try to enable garbage collection on 5.3+
     	if (function_exists('gc_enable') && !gc_enabled()) { 		
        	gc_enable();
     	}
      
      	//$mongo = new Mongo();
      	$db = 'tess';
      	$collection = 'apc';
      	
      
     	$files = array(/*'xml/apc101231-42.xml',*/
     				   /*'xml/apc110101.xml'*/ 'xml/apc110101.xml');
      
    	 foreach($files as $file) {
    	     echo "\n";
    	     echo 'Filename: '.basename($file)."\n";
    	     echo 'Start parsing...'."\n";
    	     echo "\n";
    	 
    	 
    	     $reader = new XMLReader();
    	     
    	     $reader->open($file);
     
    		 logAdd('Srart parsing');		         
    	 
    	     while ($reader->read()) {	     	 
    	         switch ($reader->nodeType) {         	
    	             case (XMLREADER::ELEMENT):	             	 
    	                 if ($reader->localName == "case-file") {
    	                 	 
    	                 	 logAdd('case-file found');
    	                 	 
    	                     $dom = new DomDocument();
    	                     $n = $dom->importNode($reader->expand(),true);
    	                     $dom->appendChild($n);
    	                     $sxe = simplexml_import_dom($n);
    	                     
    	                     logAdd('case-file in $sxe');
    			   		             
    			   		     // Insert data!        	
    		                 //$mongo->$db->$collection->insert($sxe);	                 		
    		                 
    		                 logAdd('Insert done!');
    				 
    	                     //print_r($sxe);
    	                     echo "Insert done! \n"; 
    	                     
    	                     // Now clear the memory.
    	                     unset($n, $dom, $sxe); 
    	                                                              
    	                     logAdd('Clear the memory');
    	                 }
    	             break;
    	         }	         
    	         logAdd('case-file in $sxe');
    	     }
    	     
    	     // Close the resource
    	     $reader->close();
    	     
    	     // Delete the object to free memory
    	     unset($reader);
    	 
    	 	 logAdd('Stop parsing');		     
    	 }
    	
     	$mongo->close();
    
    Code (markup):

    inc/Misc.php
    
    <?php
    	
    	
         /**
          * Convert bytes in 'kb','mb','gb','tb','pb'
          * @param integer $size Data to convert
          * @return string
          */
        function convert($size)	{
            $unit=array('b','kb','mb','gb','tb','pb');
            return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
        }	
    	
    	
        /**
         * Get memory usage
         * @param boolean $render Displays the result of the function in the browser or not
         */
        function getMemoryUsage($render=true) {
            if (function_exists('memory_get_usage')) {
                $memory_usage = memory_get_usage();
            } else if (substr(PHP_OS,0,3) == 'WIN') {
                // Windows 2000 workaround
                $output = array();
                exec('pslist ' . getmypid() , $output);
                $memory_usage = trim(substr($output[8],38,10));
            } else {
                $memory_usage = '';
            }
            if($render) {
                printf('Memory usage: '.convert($memory_usage));
            } else {
                return $memory_usage;
            }
        }
        
        
        /**
         * Get elapsed time
         * @global integer $start_time Start time value
         * @param boolean $render Displays the result of the function in the browser or not
         */
        function getElapsedTime($render=true) {
            global $start_time;
            $result_time = microtime(true) - $start_time;
            if($render) printf("Elapsed time %.3f seconds",$result_time); else return sprintf("%.3f", $result_time);
        }
    
    
        /**
         * Benchmark
         */
        function benchmark() {          
            getMemoryUsage(); echo " - "; getElapsedTime(); echo "\n";
        }
    
    
    	/**
    	 * Log add
    	 */	 
    	function logAdd($message) {		
    		file_put_contents('log.txt',$message.' - '.convert(getMemoryUsage(false))." - ".getElapsedTime(false)."\n", FILE_APPEND);
    	}
    	
    ?>
    
    Code (markup):

    Log
    I do not see any problems, but the process memory leak my script have.
     
    Awilum, May 24, 2011 IP