The ofiginal code is $pageContent = $this->getContents($this->startURL); PHP: where you have the URL in the script. I try $pageContent = $this->getContents($_GET['url']); PHP: to try to let the URL instead pick the URL the script gets, looking like domain.com/implementation.php?url=http://www.domain.com It get's the page, but it get's it over and over and over and....
Im assuming your using a custom class/function?, although php already has such function; simply do: $pageContent = file_get_contents($_GET['url']); PHP: But, you should really look at validating $_GET['url'] before retreiving its content. I don't see why it would get the content over and over, it would only get it everytime you visit domain.com/implementation.php?url=http://www.domain.com
Is $url = $_GET['url']; what validates it? I've had that at the top of the script. $pageContent = file_get_contents($_GET['url']); also tries to get it over and over and over. The code is half way down. <?php $url = $_GET['url']; $DB_USER = 'root'; $DB_PASSWORD = 'XXXXXXXXX'; $DB_HOST = 'localhost'; $DB_NAME = 'database'; $dbc = mysql_connect ($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error(); mysql_select_db($DB_NAME) or $error = mysql_error(); $new = new scraper; // Start Path can be empty, which will be extracted from the start URL $new->setStartPath(); //$new->setStartPath('http://www.domain.com'); //$new->startURL('http://www.domain.com/file.shtml'); $new->startScraping(); class scraper { // URL that stores first URL to start var $startURL; // List of allowed page extensions var $allowedExtensions = array('.css','.xml','.rss','.ico','.js','.gif','.jpg','.jpeg','.png','.bmp','.wmv' ,'.avi','.mp3','.flash','.swf','.css'); // Which URL to scrape var $useURL; // Start path, for links that are relative var $startPath; // Set start path function setStartPath($path = NULL){ if($path != NULL) { $this->startPath = $path; } else { $temp = explode('/',$this->startURL); $this->startPath = $temp[0].'//'.$temp[2]; } } // Add the start URL function startURL($theURL){ // Set start URL $this->startURL = $theURL; } // Function to get URL contents function getContents($url) { $ch = curl_init(); // initialize curl handle $buffer = curl_exec($ch); // run the whole process curl_close($ch); return $buffer; } // Actually do the URLS function startScraping() { // Get page content //Original $pageContent = $this->getContents($this->startURL); //This does it...forever. // $pageContent = $this->getContents($_GET['url']); preg_match_all('/href="([^"]+)"/Umis',$pageContent,$results); // Add to the email list array $insertCount=0; foreach($results[1] as $curEmail) { if($insert){$insertCount++;} echo <<< END <A HREF="mailto:$curEmail">E-Mail Me</a><BR> END; } // echo 'Emails found: '.number_format($insertCount).PHP_EOL; // Mark the page done $insert = mysql_query("INSERT INTO `finishedurls` (`urlname`) VALUES ('".$this->startURL."')"); // Get list of new page URLS is emails were found on previous page preg_match_all('/href="([^"]+)"/Umis',$pageContent,$results); $currentList = $this->cleanListURLs($results[1]); $insertURLCount=0; // Add the list to the array foreach($currentList as $curURL) { $insert = mysql_query("INSERT INTO `workingurls` (`urlname`) VALUES ('$curURL')"); if($insert){$insertURLCount++;} } $getURL = mysql_fetch_assoc(mysql_query("SELECT `urlname` FROM `workingurls` ORDER BY RAND() LIMIT 1")); $remove = mysql_query("DELETE FROM `workingurls` WHERE `urlname`='$getURL[urlname]' LIMIT 1"); // Get the new page ready $this->startURL = $getURL['urlname']; $this->setStartPath(); // If no more pages, return if($this->startURL == NULL){ return;} // Clean vars unset($results,$pageContent); // If more pages, loop again $this->startScraping(); } // Function to clean input URLS function cleanListURLs($linkList) { foreach($linkList as $sub => $url) { // Check if only 1 character - there must exist at least / character if(strlen($url) <= 1){unset($linkList[$sub]);} // Check for any javascript if(eregi('javascript',$url)){unset($linkList[$sub]);} // Check for invalid extensions //str_replace($this->allowedExtensions,'',$url,$count); if($count > 0){ unset($linkList[$sub]);} // If URL starts with #, ignore if(substr($url,0,1) == '#'){unset($linkList[$sub]);} // If everything is OK and path is relative, add starting path if(substr($url,0,1) == '/' || substr($url,0,1) == '?' || substr($url,0,1) == '='){ $linkList[$sub] = $this->startPath.$url; } } $remove = mysql_query("DELETE FROM `finishedurls`"); $optimize = mysql_query("OPTIMIZE TABLE `emaillist` , `finishedurls` , `workingurls`"); $optimize = mysql_query("DELETE TABLE `emaillist` , `finishedurls` , `workingurls`"); return $linkList; } } ?> PHP:
I think the problem is in this line of code: $this->startURL = $getURL['urlname']; The query just above this line is returning the same URL as was scrapped earlier. I'm not able to tell exactly because your code came up as 1 single line when I copied to notepad from here. (very difficult to read) Thanks
Actually, I can't see where does getContents() uses $url on. I mean, it lacks something like: curl_setopt($ch, CURLOPT_URL, $url); PHP: right after this line: $ch = curl_init(); // initialize curl handle PHP: Hope that helps!
Turns out $new->setStartPath(hzzp://wwww.domain.com/); just needed to have the $url in it instead! $new->setStartPath($url);