fopen works for a couple of web pages and then fails

Discussion in 'PHP' started by thescintist, Jan 16, 2010.

  1. #1
    Hello,

    I am writing to harvest the content of the Apple user forum for a research I am doing.
    I am using fopen to open a series of web pages. This works for the first couple of web pages and then stops displaying the content for the remaining pages. Below is the detailed description.

    The url for the forum with the listing of the threads is: http://discussions.apple.com/forum.jspa?forumID=1334&start=0. The last number (i.e. 0) is then incremented by 15 to move to the following page of threads (http://discussions.apple.com/forum.jspa?forumID=1334&start=15) etc...
    I am trying to open the pages one at a time, read the content and then locate the information I need.

    My partial code looks like this:
    $counter=0;
    while ($counter<1000) //1000 is just an arbitrary number
    {
    echo $subforum_url = "http://discussions.apple.com/forum.jspa?forumID=1334&start=" . $counter;
    $posthandle = fopen($subforum_url, "r");
    $i = 0;
    $postcontents = '';
    if ($posthandle) {
    while (!feof($posthandle)) {
    $postcontents .= fgets($posthandle, 8192);
    echo $postcontents;
    echo "<br />";
    $i++;
    }
    }
    $counter+=15;
    }
    this is followed by a code that matches the content for the title of threads, the date it was posted ...

    My problem is that everything works very well for the first couple of pages. I can read the content and then all the preg_match code works.
    After that all my code is able to do is to echo the url and it doesn't seem to open any content.

    Do you know what's going on?
    Any work around?

    Thank you so much.
     
    thescintist, Jan 16, 2010 IP
  2. mbaldwin

    mbaldwin Active Member

    Messages:
    215
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    95
    #2
    Your not closing it after it is opened, it might be possible your running out of memory. use fclose.
    or you can try it this way.
    
    <?php
    $counter=0;
    $page_count = 1;
    while ($counter<1000) //1000 is just an arbitrary number
    {
    $subforum_url = "http://discussions.apple.com/forum.jspa?forumID=1334&start=" ;
    $open_url = $subforum_url.$counter;
    $url_content = file_get_contents($open_url);
    echo 'Reading page'.$page_count.'<br/>';
    
    $counter+=15;
    $page_count ++;
    }
    ?>
    
    Code (markup):

    Michael
     
    mbaldwin, Jan 16, 2010 IP
  3. thescintist

    thescintist Peon

    Messages:
    3
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks for your prompt feedback.

    I tried your code and it is still erratic. It works for the first few couple of urls and then skips few then works again!!!

    This is frustrating.
     
    thescintist, Jan 17, 2010 IP
  4. mbaldwin

    mbaldwin Active Member

    Messages:
    215
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    95
    #4
    have you tried a smaller number, like 300, so it only gets 20 pages instead of 67?
     
    mbaldwin, Jan 17, 2010 IP
  5. thescintist

    thescintist Peon

    Messages:
    3
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I did try it for a small number but I still observe the same result. It keeps skipping some urls.
     
    thescintist, Jan 17, 2010 IP
  6. CodedCaffeine

    CodedCaffeine Peon

    Messages:
    130
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Try this out:

    <?php
    /**
    * cURL Loop
    *
    * A cURL retreval method for forum pages to allow for no need of fopen or f(commands).
    *
    * @package      cURL Loop
    * @author       Joel Larson
    * @copyright   	Free to use.
    * @link         http://thejoellarson.com/#
    * @date			17.01.10
    */
    
    # Set the initial counter.
    $counter = 0;
    
    # Posts per page.
    $per_page = 15;
    
    # While counter is less than the total...
    while ($counter <= 1000)
    {
    	# Set us the forum url!
    	$subforum_url = 'http://discussions.apple.com/forum.jspa?forumID=1334&start=' . $counter;
    	
    	# Initiate the cURL handle.
    	$ch = curl_init();
    	
    	# Set the cURL options.
    	curl_setopt($ch, CURLOPT_URL, $subforum_url); // cURL URL to retrieve.
    	curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); // Spoof the useragent ;D
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return the data as a string.
    	
    	# Retrieve the data and put into a string.
    	$postcontents = curl_exec($ch);
    	
    	# Close the current cURL handle. (Free up memory)
    	curl_close($ch);
    	
    	# Echo the post contents with your break.
    	echo $postcontents. "\n<br />\n";
    	
    	# Free up more memory.
    	unset($postcontents);
    	
    	# Increment the counter by the number of posts on a page.
    	$counter += $posts;
    }
    PHP:
    There may be bugs, I haven't tested it. It should work though :)
     
    CodedCaffeine, Jan 17, 2010 IP
  7. szalinski

    szalinski Peon

    Messages:
    341
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #7
    it's more likely that your requests are being blocked by the remote server after so many requests. try var_dumping fopen continuously and you should see the error.
     
    szalinski, Jan 19, 2010 IP