PHP getContents, from URL. Geting it over and over and...

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#1

The ofiginal code is
$pageContent = $this->getContents($this->startURL);
PHP:
where you have the URL in the script. I try
$pageContent = $this->getContents($_GET['url']);
PHP:
to try to let the URL instead pick the URL the script gets, looking like

domain.com/implementation.php?url=http://www.domain.com

It get's the page, but it get's it over and over and over and....

Nintendo, Feb 25, 2010 IP

danx10 Peon

Messages:: 1,179

Likes Received:: 44

Best Answers:: 2

Trophy Points:: 0

#2

Im assuming your using a custom class/function?, although php already has such function; simply do:
$pageContent = file_get_contents($_GET['url']);
PHP:
But, you should really look at validating $_GET['url'] before retreiving its content.

I don't see why it would get the content over and over, it would only get it everytime you visit domain.com/implementation.php?url=http://www.domain.com

Last edited: Feb 25, 2010

danx10, Feb 25, 2010 IP

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#3

Is

$url = $_GET['url'];

what validates it? I've had that at the top of the script.

$pageContent = file_get_contents($_GET['url']);

also tries to get it over and over and over.

The code is half way down.

<?php   

$url = $_GET['url'];

 $DB_USER =  'root';
 $DB_PASSWORD = 'XXXXXXXXX';
 $DB_HOST = 'localhost';
 $DB_NAME = 'database';
 $dbc = mysql_connect ($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
 mysql_select_db($DB_NAME) or $error = mysql_error();

$new = new scraper;
// Start Path can be empty, which will be extracted from the start URL
$new->setStartPath();
//$new->setStartPath('http://www.domain.com');
//$new->startURL('http://www.domain.com/file.shtml');


$new->startScraping();


class scraper
{
    // URL that stores first URL to start
    var $startURL;
    
    // List of allowed page extensions
    var $allowedExtensions = array('.css','.xml','.rss','.ico','.js','.gif','.jpg','.jpeg','.png','.bmp','.wmv'
        ,'.avi','.mp3','.flash','.swf','.css');
    
    // Which URL to scrape
    var $useURL;
    
    // Start path, for links that are relative
    var $startPath;
    
    // Set start path
    function setStartPath($path = NULL){
        if($path != NULL)
        {
            $this->startPath = $path;
        } else {
            $temp = explode('/',$this->startURL);
            $this->startPath = $temp[0].'//'.$temp[2];
        }
    }
    
    // Add the start URL
    function startURL($theURL){
        // Set start URL
        $this->startURL = $theURL;
    }
    
    // Function to get URL contents
    function getContents($url)
    {
        $ch = curl_init(); // initialize curl handle
        $buffer = curl_exec($ch); // run the whole process
        curl_close($ch); 
        return $buffer;
    }
    
    // Actually do the URLS
    function startScraping()
    {
        // Get page content
  
//Original
$pageContent = $this->getContents($this->startURL);

//This does it...forever.      
//     $pageContent = $this->getContents($_GET['url']);  



        preg_match_all('/href="([^"]+)"/Umis',$pageContent,$results);

        // Add to the email list array
        $insertCount=0;
        foreach($results[1] as $curEmail)
        {

            if($insert){$insertCount++;}
        echo <<< END
&lt;A HREF="mailto:$curEmail"&gt;E-Mail Me&lt;/a&gt;<BR>
END;
        }
        
  //      echo 'Emails found: '.number_format($insertCount).PHP_EOL;
        
        // Mark the page done
        $insert = mysql_query("INSERT INTO `finishedurls` (`urlname`) VALUES ('".$this->startURL."')");
        
        // Get list of new page URLS is emails were found on previous page
        preg_match_all('/href="([^"]+)"/Umis',$pageContent,$results);
        $currentList = $this->cleanListURLs($results[1]);
        
        $insertURLCount=0;
        // Add the list to the array
        foreach($currentList as $curURL)
        {
            $insert = mysql_query("INSERT INTO `workingurls` (`urlname`) VALUES ('$curURL')");
            if($insert){$insertURLCount++;}
        }

        $getURL = mysql_fetch_assoc(mysql_query("SELECT `urlname` FROM `workingurls` ORDER BY RAND() LIMIT 1"));
        $remove = mysql_query("DELETE FROM `workingurls` WHERE `urlname`='$getURL[urlname]' LIMIT 1");
        
        // Get the new page ready
        $this->startURL = $getURL['urlname'];
        $this->setStartPath();
        
        // If no more pages, return
        if($this->startURL == NULL){ return;}
        // Clean vars
        unset($results,$pageContent);
        // If more pages, loop again
        $this->startScraping();
    }
    
    // Function to clean input URLS
    function cleanListURLs($linkList)
    {    
        foreach($linkList as $sub => $url)
        {
            // Check if only 1 character - there must exist at least / character
            if(strlen($url) <= 1){unset($linkList[$sub]);}
            // Check for any javascript
            if(eregi('javascript',$url)){unset($linkList[$sub]);}
            // Check for invalid extensions
            //str_replace($this->allowedExtensions,'',$url,$count);
            if($count > 0){ unset($linkList[$sub]);}
            // If URL starts with #, ignore
            if(substr($url,0,1) == '#'){unset($linkList[$sub]);}
            
            // If everything is OK and path is relative, add starting path
            if(substr($url,0,1) == '/' || substr($url,0,1) == '?' || substr($url,0,1) == '='){
                $linkList[$sub] = $this->startPath.$url;
            }
        }
        
        $remove = mysql_query("DELETE FROM `finishedurls`");
        $optimize = mysql_query("OPTIMIZE TABLE  `emaillist` , `finishedurls` , `workingurls`");
        $optimize = mysql_query("DELETE TABLE  `emaillist` , `finishedurls` , `workingurls`");

        return $linkList;
    }
    
}
?>

PHP:

Nintendo, Feb 26, 2010 IP

JEET Notable Member

Messages:: 3,832

Likes Received:: 502

Best Answers:: 19

Trophy Points:: 265

#4

I think the problem is in this line of code:

$this->startURL = $getURL['urlname'];
The query just above this line is returning the same URL as was scrapped earlier.
I'm not able to tell exactly because your code came up as 1 single line when I copied to notepad from here. (very difficult to read)
Thanks

JEET, Feb 26, 2010 IP

HostingProvider Active Member

Messages:: 1,480

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 95

#5

Actually, I can't see where does getContents() uses $url on. I mean, it lacks something like:
curl_setopt($ch, CURLOPT_URL, $url);
PHP:
right after this line:
$ch = curl_init(); // initialize curl handle
PHP:
Hope that helps!

HostingProvider, Feb 26, 2010 IP

Nintendo ♬ King of da Wackos ♬

Messages:: 12,890

Likes Received:: 1,064

Best Answers:: 0

Trophy Points:: 430

#6

Turns out

$new->setStartPath(hzzp://wwww.domain.com/);

just needed to have the $url in it instead!

$new->setStartPath($url);

Nintendo, Feb 26, 2010 IP

Log in or Sign up

PHP getContents, from URL. Geting it over and over and...

Nintendo ♬ King of da Wackos ♬

danx10 Peon

Nintendo ♬ King of da Wackos ♬

JEET Notable Member

HostingProvider Active Member

Nintendo ♬ King of da Wackos ♬

Useful Searches