PHP getContents, from URL. Geting it over and over and...

Discussion in 'PHP' started by Nintendo, Feb 25, 2010.

  1. #1
    The ofiginal code is

    $pageContent = $this->getContents($this->startURL);
    PHP:
    where you have the URL in the script. I try

    $pageContent = $this->getContents($_GET['url']);
    PHP:
    to try to let the URL instead pick the URL the script gets, looking like

    domain.com/implementation.php?url=http://www.domain.com

    It get's the page, but it get's it over and over and over and....
     
    Nintendo, Feb 25, 2010 IP
  2. danx10

    danx10 Peon

    Messages:
    1,179
    Likes Received:
    44
    Best Answers:
    2
    Trophy Points:
    0
    #2
    Im assuming your using a custom class/function?, although php already has such function; simply do:

    $pageContent = file_get_contents($_GET['url']);
    PHP:
    But, you should really look at validating $_GET['url'] before retreiving its content.

    I don't see why it would get the content over and over, it would only get it everytime you visit domain.com/implementation.php?url=http://www.domain.com
     
    Last edited: Feb 25, 2010
    danx10, Feb 25, 2010 IP
  3. Nintendo

    Nintendo ♬ King of da Wackos ♬

    Messages:
    12,890
    Likes Received:
    1,064
    Best Answers:
    0
    Trophy Points:
    430
    #3
    Is

    $url = $_GET['url'];

    what validates it? I've had that at the top of the script.

    $pageContent = file_get_contents($_GET['url']);

    also tries to get it over and over and over.

    The code is half way down.

    <?php   
    
    $url = $_GET['url'];
    
     $DB_USER =  'root';
     $DB_PASSWORD = 'XXXXXXXXX';
     $DB_HOST = 'localhost';
     $DB_NAME = 'database';
     $dbc = mysql_connect ($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
     mysql_select_db($DB_NAME) or $error = mysql_error();
    
    $new = new scraper;
    // Start Path can be empty, which will be extracted from the start URL
    $new->setStartPath();
    //$new->setStartPath('http://www.domain.com');
    //$new->startURL('http://www.domain.com/file.shtml');
    
    
    $new->startScraping();
    
    
    class scraper
    {
        // URL that stores first URL to start
        var $startURL;
        
        // List of allowed page extensions
        var $allowedExtensions = array('.css','.xml','.rss','.ico','.js','.gif','.jpg','.jpeg','.png','.bmp','.wmv'
            ,'.avi','.mp3','.flash','.swf','.css');
        
        // Which URL to scrape
        var $useURL;
        
        // Start path, for links that are relative
        var $startPath;
        
        // Set start path
        function setStartPath($path = NULL){
            if($path != NULL)
            {
                $this->startPath = $path;
            } else {
                $temp = explode('/',$this->startURL);
                $this->startPath = $temp[0].'//'.$temp[2];
            }
        }
        
        // Add the start URL
        function startURL($theURL){
            // Set start URL
            $this->startURL = $theURL;
        }
        
        // Function to get URL contents
        function getContents($url)
        {
            $ch = curl_init(); // initialize curl handle
            $buffer = curl_exec($ch); // run the whole process
            curl_close($ch); 
            return $buffer;
        }
        
        // Actually do the URLS
        function startScraping()
        {
            // Get page content
      
    //Original
    $pageContent = $this->getContents($this->startURL);
    
    //This does it...forever.      
    //     $pageContent = $this->getContents($_GET['url']);  
    
    
    
            preg_match_all('/href="([^"]+)"/Umis',$pageContent,$results);
    
            // Add to the email list array
            $insertCount=0;
            foreach($results[1] as $curEmail)
            {
    
                if($insert){$insertCount++;}
            echo <<< END
    &lt;A HREF="mailto:$curEmail"&gt;E-Mail Me&lt;/a&gt;<BR>
    END;
            }
            
      //      echo 'Emails found: '.number_format($insertCount).PHP_EOL;
            
            // Mark the page done
            $insert = mysql_query("INSERT INTO `finishedurls` (`urlname`) VALUES ('".$this->startURL."')");
            
            // Get list of new page URLS is emails were found on previous page
            preg_match_all('/href="([^"]+)"/Umis',$pageContent,$results);
            $currentList = $this->cleanListURLs($results[1]);
            
            $insertURLCount=0;
            // Add the list to the array
            foreach($currentList as $curURL)
            {
                $insert = mysql_query("INSERT INTO `workingurls` (`urlname`) VALUES ('$curURL')");
                if($insert){$insertURLCount++;}
            }
    
            $getURL = mysql_fetch_assoc(mysql_query("SELECT `urlname` FROM `workingurls` ORDER BY RAND() LIMIT 1"));
            $remove = mysql_query("DELETE FROM `workingurls` WHERE `urlname`='$getURL[urlname]' LIMIT 1");
            
            // Get the new page ready
            $this->startURL = $getURL['urlname'];
            $this->setStartPath();
            
            // If no more pages, return
            if($this->startURL == NULL){ return;}
            // Clean vars
            unset($results,$pageContent);
            // If more pages, loop again
            $this->startScraping();
        }
        
        // Function to clean input URLS
        function cleanListURLs($linkList)
        {    
            foreach($linkList as $sub => $url)
            {
                // Check if only 1 character - there must exist at least / character
                if(strlen($url) <= 1){unset($linkList[$sub]);}
                // Check for any javascript
                if(eregi('javascript',$url)){unset($linkList[$sub]);}
                // Check for invalid extensions
                //str_replace($this->allowedExtensions,'',$url,$count);
                if($count > 0){ unset($linkList[$sub]);}
                // If URL starts with #, ignore
                if(substr($url,0,1) == '#'){unset($linkList[$sub]);}
                
                // If everything is OK and path is relative, add starting path
                if(substr($url,0,1) == '/' || substr($url,0,1) == '?' || substr($url,0,1) == '='){
                    $linkList[$sub] = $this->startPath.$url;
                }
            }
            
            $remove = mysql_query("DELETE FROM `finishedurls`");
            $optimize = mysql_query("OPTIMIZE TABLE  `emaillist` , `finishedurls` , `workingurls`");
            $optimize = mysql_query("DELETE TABLE  `emaillist` , `finishedurls` , `workingurls`");
    
            return $linkList;
        }
        
    }
    ?>
    PHP:
     
    Nintendo, Feb 26, 2010 IP
  4. JEET

    JEET Notable Member

    Messages:
    3,832
    Likes Received:
    502
    Best Answers:
    19
    Trophy Points:
    265
    #4
    I think the problem is in this line of code:

    $this->startURL = $getURL['urlname'];
    The query just above this line is returning the same URL as was scrapped earlier.
    I'm not able to tell exactly because your code came up as 1 single line when I copied to notepad from here. (very difficult to read)
    Thanks :)
     
    JEET, Feb 26, 2010 IP
  5. HostingProvider

    HostingProvider Active Member

    Messages:
    1,480
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    95
    #5
    Actually, I can't see where does getContents() uses $url on. I mean, it lacks something like:

    curl_setopt($ch, CURLOPT_URL, $url);
    PHP:
    right after this line:
    $ch = curl_init(); // initialize curl handle
    PHP:
    Hope that helps!
     
    HostingProvider, Feb 26, 2010 IP
  6. Nintendo

    Nintendo ♬ King of da Wackos ♬

    Messages:
    12,890
    Likes Received:
    1,064
    Best Answers:
    0
    Trophy Points:
    430
    #6
    Turns out

    $new->setStartPath(hzzp://wwww.domain.com/);

    just needed to have the $url in it instead!

    $new->setStartPath($url);
     
    Nintendo, Feb 26, 2010 IP