My Spider Script threatened to take down the server

JamesFarrell Peon

Messages:: 52

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Hi guys,

I've been working on a script to spider my site and return a list of all URLs.

I was gettings some results last night and next thing the site I was spidering (my own) was taken down by hostgator and a "Site unavailable, contact billing message" was put up instead.

They said that the server became unstable due to a massive surge in traffic.

After 12 hours down time they tell me they'll put my site back up if I agree not to run that script again.

I need the script (for sitemaps and indexing) and I've spent ages on it now so am emotionally attached...

Anyway, here's the script, could someone please advise me on what I'm doing wrong (if anything) and a solution.


<?php



// give this function an array of URLs to crawl and it'll return the links (very very fast).

function crawl($urls)

{

    global $inurls;

    $inurls = array();

    $socketh = curl_multi_init();

    

    foreach($urls as $i => $url)

    {

        $socket[$i] = curl_init();

        curl_setopt($socket[$i], CURLOPT_URL, $url);

        //By default curl will display the response straight to the browser as the script is executed. To counter this we enabled the CURLOPT_RETURNTRANSFER option

        curl_setopt($socket[$i], CURLOPT_RETURNTRANSFER, 1);

        curl_setopt($socket[$i], CURLOPT_FOLLOWLOCATION, 1);

        curl_multi_add_handle($socketh, $socket[$i]);

    }

    $working = NULL;

    do { $x = curl_multi_exec($socketh, $working); }

    while ($working);

    foreach($urls as $i => $url)

    {

        echo "<P><B>in second for each for URL - $url</b></p>";

        //echo "<P>********************************************</p>";

        $data = curl_multi_getcontent($socket[$i]);

        curl_close($socket[$i]);

        

        // search $data for matches of***'/href=\"([a-zA-Z0-9\.\-\/]+?)\"/'**** AND puts the results in the array $Matches

        preg_match_all('/href=\"([a-zA-Z0-9\.\-\/\?\=]+?)\"/',$data,$matches);

        

        foreach ($matches[1] as $j => $match)

        {            

            $formattedlink [$j] = format_link ($match, $url);

            echo "<p>Formatted link is ".$formattedlink[$j]."</p>";

        }

        

        If (count($formattedlink) > 0)

            $inurls = array_merge($inurls , $formattedlink);

        else

            //echo "<b>*** ZERO URLS FOR $url ***</B>";

        

        unset($matches);

    }

    

    return(array_unique($inurls));

    

}



function format_link ($inlink, $inpage) {



    //If contains a session id remove it

    

    if (strpos ($inlink, '?sid=') <> "")

        {

        $inlink = substr($inlink, 0, strpos ($inlink, '?sid=')); 

        echo "editted inlink is $inlink";

        }

    if (strpos ($inlink, '-sid=') <> "")

    {

        $inlink = substr($inlink, 0, strpos ($inlink, '-sid=')); 

        echo "editted inlink is $inlink";

    }

    if (strpos ($inlink, '&amp;sid=') <> "")

    {

        $inlink = substr($inlink, 0, strpos ($inlink, '&amp;sid=')); 

        echo "editted inlink is $inlink";

    }





    

    

    //Is it - http://www.something.com  ?

    If ((substr($inpage, 0, 7) == 'http://') AND (strripos ($inpage, '/') == 6))

        $inpage = $inpage.'/';

        

    // Does it start with ". /"

    If (substr($inlink, 0, 2) == './') 

                $inlink = substr($inlink, 2);



    //echo "<p> *** inlink is $inlink</p>";

    //echo "<p> *** inpage is $inpage</p>";

    

    if (substr($inlink, 0, 4) == 'http')

        //links is full on http://

        $out_link = $inlink;

    else

    {

                    

        if (substr($inlink, 0, 1) == '/')

        {

            $domain = substr ($inpage, 0, strpos ($inpage, '/', 7));

            //echo "<p>Domain is $domain</p>";

            //echo "position of first slash ".strpos ($inpage, '/', 7);

            $out_link = $domain.$inlink;

            

        }

        else

        //either normal or up dirctory OR ./

        {    



            $updirectory = 0;

        

            If (substr($inlink, 0, 3) == '../');

            {

                $worklink = $inlink;

                                    

                //count all the ../ s and remove them from the link

                while (substr($worklink, 0, 3) == '../') {

                    

                    $worklink = substr ($worklink, 3);

                    $updirectory++;

                }

            }



            $out_link = process_link ($worklink, $updirectory, $inpage);    



            

            

        }

    }

    

    //echo "<p>Final out link is $out_link.</p>";

    

    return ($out_link );



}



function process_link ($inlink, $updirectory, $inpage) {



    if (substr($inlink, 0, 1) <> '/') $inlink = '/'.$inlink;

    

    $work_page = $inpage;

    

    if ((strripos ($work_page, '/') + 1) <> strlen ($work_page)) // if last character not a slash

        $work_page = $work_page.'/';

    

    

    $i = 0;



    do {

        //find last slash in url

        $slash_pos = strripos ($work_page, '/');

        

        //$slash_string = substr($work_page, $slash_pos, strlen ($work_page) - strpos ($work_page, '/', strlen ($work_page)));

        // work page is substr (original inpage as far last slash

        

        $work_page = substr($inpage, 0, $slash_pos);

        

        if ($i == $updirectory)

        {

            $outpage = $work_page.$inlink;

        }

        

        $i++;

        

    } while (strripos ($work_page, '/') <> "");



    return $outpage;

}



$unique_array [1] = 'http://www.websitename.com';



for ($level = 1; $level <= 4; $level++) {

    

    $unique_array = crawl ($unique_array);

    echo "<p><b>***** PASS $level unique links **** </b></p>";

    foreach ($unique_array as $k => $unique)

    {

        echo "<p>$unique</p>";

        //echo "<p>Unique link is <b>".$unique_array[$k]."</b></p>";

    }

    echo "<p><b>total is ".count($unique_array)." for pass $level</b></p>";

    

}



?>

PHP:

JamesFarrell, Jan 23, 2008 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#2

Use less URLs, and give it a little break between the requests.

And I wouldn't necessarily use the curl_multi_* functions, because they handle (as the name says) multiple requests, which takes more memory and slows down the server.

If your script takes long with its execution, then reduce the amount of URLs you feed it with. I don't know about Hostgators specific limits, but I'd rather go the careful way.

nico_swd, Jan 23, 2008 IP

JamesFarrell Peon

Messages:: 52

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

Thanks for your response Nico.

Use less URLs, and give it a little break between the requests.
Click to expand...

Do you mean, use something like:
sleep(5);
PHP:
How much of a sleepwould you recommend?

How many URLs do you think would be safe to use?

And I wouldn't necessarily use the curl_multi_* functions, because they handle (as the name says) multiple requests, which takes more memory and slows down the server.
Click to expand...

Thing is, I'm in a little over my head on this and have spent ages trying to get this script working... do you or anybody else have an example of extracting many links using curl_exec?

Even pseudo code would be useful as I've only ever used to look for once occurace like this:
$ch = curl_init();    // initialize curl handle
	curl_setopt($ch, CURLOPT_URL,$url); // set url to post to
	curl_setopt($ch, CURLOPT_FAILONERROR, 1);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
	curl_setopt($ch, CURLOPT_TIMEOUT, 3); // times out after 4s
	$result = curl_exec($ch); // run the whole process
	curl_close($ch);

	//Do the pregging
	$count = preg_match_all("| of about (.*) from |Ui", $result, $out, PREG_PATTERN_ORDER);
PHP:

JamesFarrell, Jan 23, 2008 IP

wisdomtool Moderator Staff

Messages:: 15,826

Likes Received:: 1,367

Best Answers:: 1

Trophy Points:: 455

#4

Also schedule it on times when there are low activity for the shared server you are in. You may want to coordinate with your hosting provider hostgator, explain to them your issues, maybe they will let you run your spider during the off peaks hours for that server?

wisdomtool, Jan 23, 2008 IP

Log in or Sign up

My Spider Script threatened to take down the server

JamesFarrell Peon

nico_swd Prominent Member

JamesFarrell Peon

wisdomtool Moderator Staff

Useful Searches