Hello, I am writing to harvest the content of the Apple user forum for a research I am doing. I am using fopen to open a series of web pages. This works for the first couple of web pages and then stops displaying the content for the remaining pages. Below is the detailed description. The url for the forum with the listing of the threads is: http://discussions.apple.com/forum.jspa?forumID=1334&start=0. The last number (i.e. 0) is then incremented by 15 to move to the following page of threads (http://discussions.apple.com/forum.jspa?forumID=1334&start=15) etc... I am trying to open the pages one at a time, read the content and then locate the information I need. My partial code looks like this: $counter=0; while ($counter<1000) //1000 is just an arbitrary number { echo $subforum_url = "http://discussions.apple.com/forum.jspa?forumID=1334&start=" . $counter; $posthandle = fopen($subforum_url, "r"); $i = 0; $postcontents = ''; if ($posthandle) { while (!feof($posthandle)) { $postcontents .= fgets($posthandle, 8192); echo $postcontents; echo "<br />"; $i++; } } $counter+=15; } this is followed by a code that matches the content for the title of threads, the date it was posted ... My problem is that everything works very well for the first couple of pages. I can read the content and then all the preg_match code works. After that all my code is able to do is to echo the url and it doesn't seem to open any content. Do you know what's going on? Any work around? Thank you so much.
Your not closing it after it is opened, it might be possible your running out of memory. use fclose. or you can try it this way. <?php $counter=0; $page_count = 1; while ($counter<1000) //1000 is just an arbitrary number { $subforum_url = "http://discussions.apple.com/forum.jspa?forumID=1334&start=" ; $open_url = $subforum_url.$counter; $url_content = file_get_contents($open_url); echo 'Reading page'.$page_count.'<br/>'; $counter+=15; $page_count ++; } ?> Code (markup): Michael
Thanks for your prompt feedback. I tried your code and it is still erratic. It works for the first few couple of urls and then skips few then works again!!! This is frustrating.
Try this out: <?php /** * cURL Loop * * A cURL retreval method for forum pages to allow for no need of fopen or f(commands). * * @package cURL Loop * @author Joel Larson * @copyright Free to use. * @link http://thejoellarson.com/# * @date 17.01.10 */ # Set the initial counter. $counter = 0; # Posts per page. $per_page = 15; # While counter is less than the total... while ($counter <= 1000) { # Set us the forum url! $subforum_url = 'http://discussions.apple.com/forum.jspa?forumID=1334&start=' . $counter; # Initiate the cURL handle. $ch = curl_init(); # Set the cURL options. curl_setopt($ch, CURLOPT_URL, $subforum_url); // cURL URL to retrieve. curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); // Spoof the useragent ;D curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return the data as a string. # Retrieve the data and put into a string. $postcontents = curl_exec($ch); # Close the current cURL handle. (Free up memory) curl_close($ch); # Echo the post contents with your break. echo $postcontents. "\n<br />\n"; # Free up more memory. unset($postcontents); # Increment the counter by the number of posts on a page. $counter += $posts; } PHP: There may be bugs, I haven't tested it. It should work though
it's more likely that your requests are being blocked by the remote server after so many requests. try var_dumping fopen continuously and you should see the error.