My Spider Script threatened to take down the server

Discussion in 'PHP' started by JamesFarrell, Jan 23, 2008.

  1. #1
    Hi guys,

    I've been working on a script to spider my site and return a list of all URLs.

    I was gettings some results last night and next thing the site I was spidering (my own) was taken down by hostgator and a "Site unavailable, contact billing message" was put up instead.

    They said that the server became unstable due to a massive surge in traffic.

    After 12 hours down time they tell me they'll put my site back up if I agree not to run that script again.

    I need the script (for sitemaps and indexing) and I've spent ages on it now so am emotionally attached...

    Anyway, here's the script, could someone please advise me on what I'm doing wrong (if anything) and a solution.
    
    <?php
    
    
    
    // give this function an array of URLs to crawl and it'll return the links (very very fast).
    
    function crawl($urls)
    
    {
    
        global $inurls;
    
        $inurls = array();
    
        $socketh = curl_multi_init();
    
        
    
        foreach($urls as $i => $url)
    
        {
    
            $socket[$i] = curl_init();
    
            curl_setopt($socket[$i], CURLOPT_URL, $url);
    
            //By default curl will display the response straight to the browser as the script is executed. To counter this we enabled the CURLOPT_RETURNTRANSFER option
    
            curl_setopt($socket[$i], CURLOPT_RETURNTRANSFER, 1);
    
            curl_setopt($socket[$i], CURLOPT_FOLLOWLOCATION, 1);
    
            curl_multi_add_handle($socketh, $socket[$i]);
    
        }
    
        $working = NULL;
    
        do { $x = curl_multi_exec($socketh, $working); }
    
        while ($working);
    
        foreach($urls as $i => $url)
    
        {
    
            echo "<P><B>in second for each for URL - $url</b></p>";
    
            //echo "<P>********************************************</p>";
    
            $data = curl_multi_getcontent($socket[$i]);
    
            curl_close($socket[$i]);
    
            
    
            // search $data for matches of***'/href=\"([a-zA-Z0-9\.\-\/]+?)\"/'**** AND puts the results in the array $Matches
    
            preg_match_all('/href=\"([a-zA-Z0-9\.\-\/\?\=]+?)\"/',$data,$matches);
    
            
    
            foreach ($matches[1] as $j => $match)
    
            {            
    
                $formattedlink [$j] = format_link ($match, $url);
    
                echo "<p>Formatted link is ".$formattedlink[$j]."</p>";
    
            }
    
            
    
            If (count($formattedlink) > 0)
    
                $inurls = array_merge($inurls , $formattedlink);
    
            else
    
                //echo "<b>*** ZERO URLS FOR $url ***</B>";
    
            
    
            unset($matches);
    
        }
    
        
    
        return(array_unique($inurls));
    
        
    
    }
    
    
    
    function format_link ($inlink, $inpage) {
    
    
    
        //If contains a session id remove it
    
        
    
        if (strpos ($inlink, '?sid=') <> "")
    
            {
    
            $inlink = substr($inlink, 0, strpos ($inlink, '?sid=')); 
    
            echo "editted inlink is $inlink";
    
            }
    
        if (strpos ($inlink, '-sid=') <> "")
    
        {
    
            $inlink = substr($inlink, 0, strpos ($inlink, '-sid=')); 
    
            echo "editted inlink is $inlink";
    
        }
    
        if (strpos ($inlink, '&amp;sid=') <> "")
    
        {
    
            $inlink = substr($inlink, 0, strpos ($inlink, '&amp;sid=')); 
    
            echo "editted inlink is $inlink";
    
        }
    
    
    
    
    
        
    
        
    
        //Is it - http://www.something.com  ?
    
        If ((substr($inpage, 0, 7) == 'http://') AND (strripos ($inpage, '/') == 6))
    
            $inpage = $inpage.'/';
    
            
    
        // Does it start with ". /"
    
        If (substr($inlink, 0, 2) == './') 
    
                    $inlink = substr($inlink, 2);
    
    
    
        //echo "<p> *** inlink is $inlink</p>";
    
        //echo "<p> *** inpage is $inpage</p>";
    
        
    
        if (substr($inlink, 0, 4) == 'http')
    
            //links is full on http://
    
            $out_link = $inlink;
    
        else
    
        {
    
                        
    
            if (substr($inlink, 0, 1) == '/')
    
            {
    
                $domain = substr ($inpage, 0, strpos ($inpage, '/', 7));
    
                //echo "<p>Domain is $domain</p>";
    
                //echo "position of first slash ".strpos ($inpage, '/', 7);
    
                $out_link = $domain.$inlink;
    
                
    
            }
    
            else
    
            //either normal or up dirctory OR ./
    
            {    
    
    
    
                $updirectory = 0;
    
            
    
                If (substr($inlink, 0, 3) == '../');
    
                {
    
                    $worklink = $inlink;
    
                                        
    
                    //count all the ../ s and remove them from the link
    
                    while (substr($worklink, 0, 3) == '../') {
    
                        
    
                        $worklink = substr ($worklink, 3);
    
                        $updirectory++;
    
                    }
    
                }
    
    
    
                $out_link = process_link ($worklink, $updirectory, $inpage);    
    
    
    
                
    
                
    
            }
    
        }
    
        
    
        //echo "<p>Final out link is $out_link.</p>";
    
        
    
        return ($out_link );
    
    
    
    }
    
    
    
    function process_link ($inlink, $updirectory, $inpage) {
    
    
    
        if (substr($inlink, 0, 1) <> '/') $inlink = '/'.$inlink;
    
        
    
        $work_page = $inpage;
    
        
    
        if ((strripos ($work_page, '/') + 1) <> strlen ($work_page)) // if last character not a slash
    
            $work_page = $work_page.'/';
    
        
    
        
    
        $i = 0;
    
    
    
        do {
    
            //find last slash in url
    
            $slash_pos = strripos ($work_page, '/');
    
            
    
            //$slash_string = substr($work_page, $slash_pos, strlen ($work_page) - strpos ($work_page, '/', strlen ($work_page)));
    
            // work page is substr (original inpage as far last slash
    
            
    
            $work_page = substr($inpage, 0, $slash_pos);
    
            
    
            if ($i == $updirectory)
    
            {
    
                $outpage = $work_page.$inlink;
    
            }
    
            
    
            $i++;
    
            
    
        } while (strripos ($work_page, '/') <> "");
    
    
    
        return $outpage;
    
    }
    
    
    
    $unique_array [1] = 'http://www.websitename.com';
    
    
    
    for ($level = 1; $level <= 4; $level++) {
    
        
    
        $unique_array = crawl ($unique_array);
    
        echo "<p><b>***** PASS $level unique links **** </b></p>";
    
        foreach ($unique_array as $k => $unique)
    
        {
    
            echo "<p>$unique</p>";
    
            //echo "<p>Unique link is <b>".$unique_array[$k]."</b></p>";
    
        }
    
        echo "<p><b>total is ".count($unique_array)." for pass $level</b></p>";
    
        
    
    }
    
    
    
    ?>
    
    
    PHP:
     
    JamesFarrell, Jan 23, 2008 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    Use less URLs, and give it a little break between the requests.

    And I wouldn't necessarily use the curl_multi_* functions, because they handle (as the name says) multiple requests, which takes more memory and slows down the server.

    If your script takes long with its execution, then reduce the amount of URLs you feed it with. I don't know about Hostgators specific limits, but I'd rather go the careful way.
     
    nico_swd, Jan 23, 2008 IP
  3. JamesFarrell

    JamesFarrell Peon

    Messages:
    52
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks for your response Nico.

    Do you mean, use something like:
    sleep(5);
    PHP:
    How much of a sleepwould you recommend?

    How many URLs do you think would be safe to use?

    Thing is, I'm in a little over my head on this and have spent ages trying to get this script working... do you or anybody else have an example of extracting many links using curl_exec?

    Even pseudo code would be useful as I've only ever used to look for once occurace like this:
    $ch = curl_init();    // initialize curl handle
    	curl_setopt($ch, CURLOPT_URL,$url); // set url to post to
    	curl_setopt($ch, CURLOPT_FAILONERROR, 1);
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
    	curl_setopt($ch, CURLOPT_TIMEOUT, 3); // times out after 4s
    	$result = curl_exec($ch); // run the whole process
    	curl_close($ch);
    
    	//Do the pregging
    	$count = preg_match_all("| of about (.*) from |Ui", $result, $out, PREG_PATTERN_ORDER);
    
    PHP:
     
    JamesFarrell, Jan 23, 2008 IP
  4. wisdomtool

    wisdomtool Moderator Staff

    Messages:
    15,826
    Likes Received:
    1,367
    Best Answers:
    1
    Trophy Points:
    455
    #4
    Also schedule it on times when there are low activity for the shared server you are in. You may want to coordinate with your hosting provider hostgator, explain to them your issues, maybe they will let you run your spider during the off peaks hours for that server?
     
    wisdomtool, Jan 23, 2008 IP