simple php spider

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#1

Trying to write a simple spider that will follow all links and print the link out on the screen.

its not working correctly. It is only printing the first seed link then quitting.

Whats wrong here?
<?php

$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';
spider_man($seed);

function spider_man($url)
{
	echo "Following ".$url." \n";
	$data = file_get_contents($seed);
	if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
	{
		foreach ($links[0] as $link) 
		{
			spider_man($link);
		}
	}
}

?>
Code (markup):

advancedfuture, Mar 25, 2007 IP

DomainMagnate Illustrious Member

Messages:: 10,932

Likes Received:: 1,022

Best Answers:: 0

Trophy Points:: 455

#2

err i'm not much of a programmer, but there's a chance you'll spider all the internet with that recursive function

DomainMagnate, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#3

thats what my aim is....

well... at least to spider until I decided to kill the page.

advancedfuture, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#4

ahh I found out what was wrong....
I was continually calling $seed. When the line should have been:
$data = file_get_contents($url);
Code (markup):
HOWEVER

After 30 seconds of it spitting out results I get.

Fatal error: Maximum execution time of 30 seconds exceeded in C:\apache\htdocs\spider\random3.php on line 10
Click to expand...

Anyway around this?

advancedfuture, Mar 25, 2007 IP

aredhelrim Peon

Messages:: 7

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

Try this :
<?php
// aredhelrim
$seed = $http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/";
$html = file_get_contents($seed);
echo "Page : " . $seed;
preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
echo " links : " . $val[0] . "\r\n";

}
?>

aredhelrim, Mar 25, 2007 IP

NickD Well-Known Member

Messages:: 262

Likes Received:: 9

Best Answers:: 0

Trophy Points:: 130

#6

set_time_limit(0);

NickD, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#7

that worked....

the only other issue now is on alot of websites... like dmoz/wiki...

it gets stuck in a never ending loop trying to spider

http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Im not very good with preg_match. How would I go about making it so it ignored dtd and css files?

advancedfuture, Mar 25, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#8


<?
set_time_limit( 0 );
class spider_man
{
	var $limit;
	var $cache;
	var $crawled;
	var $banned_ext;
	
	function spider_man( $url, $banned_ext, $limit )
	{
		$this->start = $url ;
		$this->banned_ext = $banned_ext ;
		$this->limit = $limit ;
		
		if( !fopen( $url, "r") ) return false;
		else $this->_spider( $url );
	}
	function _spider( $url )
	{
		$this->cache = @file_get_contents( urldecode( $url ) );
		if( !$this->cache ) return false;
		$this->crawled[] = urldecode( $url ) ;
		preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links );
		if ( $links ) :
			foreach ( $links[1] as $hyperlink )
			{
				$this->limit--;
				if( ! $this->limit  ) return;
				
				if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) :
					$this->crawled[] = $hyperlink;
					echo "Crawling $hyperlink<br />\n";
					unset( $this->cache );
					$this->_spider( $hyperlink );
				endif;
			}
		endif;
	}
	function is_valid_ext( $url )
	{	
		foreach( $this->banned_ext as $ext )
		{
			if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
		}
		return true;
	}
	function is_crawled( $url )
	{
		return in_array( $url, $this->crawled );
	}
}

$banned_ext = array
(
	".dtd",
	".css",
	".xml",
	".js",
	".gif",
	".jpg",
	".jpeg",
	".bmp",
	".ico",
	".rss"
);
$spider = new spider_man( 'http://www.msn.com/', $banned_ext, 100 );
print_r( $spider->crawled );
?>

PHP:

that works, you'll prolly need to edit it some more, but that's how I would achieve such a thing .....

krakjoe, Mar 25, 2007 IP

advancedfuture likes this.

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#9

wow holy crap, mad props:

LoL

And you whipped it together in no time, I wasn't expecting this! Thank you, it actually works flawlessly. Now its time for me to dissect it some more And then incorporate my source.. I have a really awesome project I'm working on right now

advancedfuture, Mar 25, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#10

advancedfuture said: ↑

I wasn't expecting this!
Click to expand...

if you post in these php forums and the post isn't boring as hell, more often than not my first post will be the solution, it's how I keep sharp.

advancedfuture said: ↑

Thank you, it actually works flawlessly.
Click to expand...

See sig : I'm not playin

advancedfuture said: ↑

Now its time for me to dissect it some more And then incorporate my source.. I have a really awesome project I'm working on right now
Click to expand...

Thought you might have to, I can't really see what use it would be as is, out of curiosity whats the project ?

krakjoe, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#11

PHP kills your script after its consumed a max of 8mb of memory. You know if there is a way around that restriction? PHP.ini says max is 8mb... I set it to 8000 but it does not matter.

advancedfuture, Mar 25, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#12

ini_set("memory_limit", 128 );

it's gotta be a amount of memory that's actually available else it'll fall back to system default which is 8mb, do echo ini_get("memory_limit"); exit; after you set it to test if it's possible for you to raise limits at all.

It's going to be memory intensive, you could try writing cache files and using ini's instead of self::$cache and self::$crawled, if you're a bit more specific about what it'll do in the end I can be more specific about how to go about it.

krakjoe, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#13

I'm going to have the spider crawl links, grab a small portion of content from each page. Then store the content and URL in MySQL. All in all this is going to be a very stripped down search engine. Nothing fancy or elaborate. I am trying to shoot for getting lets say 1000-10000 results without my computer taking a dump. But then again, there may NOT be an efficient way of doing this with PHP.

If I wanted to get crazy I really would have to do this in C++. I know basic C++ but I think it would be beyond me at this point to do something like this in C++.

advancedfuture, Mar 25, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#14

if you come at me on msn, I got some time ..... it's possible, using mysql is even better you got mysql_free_result on your side, theres no reason for it to take up any substantial amount of memory, + I got something I'd like to try ......

krakjoe, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#15

sure I'll hop on. I have to remember my MSN info first. Can't say I am going to follow much right now. Would you believe I've been up since friday morning without sleep? lol. Been trying to get another web project done by tonight.

advancedfuture, Mar 25, 2007 IP

advancedfuture Banned

Messages:: 481

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 0

#16

Well I have been playing around with the spider working on getting it to do what I need.

So far I'm getting some query problems.

The program opened up the DB, wrote the first URLs worth of garbled data. Closed. Then will not write the data for the next links. :confused:

Anyways hit me up with some idea's about using sql to cache all the links on the page for spidering. As it is write now it seems to only spider the first few links on the page then move on to others.


<?
set_time_limit( 0 );
ini_set("memory_limit", 128 );

//strip HTML and Javascript
function html2txt($document){
$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
               '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
               '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
               '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
);
$text = preg_replace($search, '', $document);
return $text;
}

//open db and write garbled data
function write_data($data)
{
	require("connectDB.php");
	$query = "INSERT INTO repository (data) ". "VALUES ('$data')";
	mysql_query($query) or die('Error, query failed');
	require("disconnectDB.php");
}

//garble the data that we get from the website.
function garble($hyperlink)
{
		$html = implode(" ", file($hyperlink));
		$html = strip_tags(html2txt($html));
		$s = explode(" ",$html);
		shuffle($s);
		$new_html = implode(" ", $s);
		//echo "<br />" . $new_html . "<br />";
		write_data($new_html);
}


class spider_man
{
    var $limit;
    var $cache;
    var $crawled;
    var $banned_ext;
   
    function spider_man( $url, $banned_ext, $limit )
    {
        $this->start = $url ;
        $this->banned_ext = $banned_ext ;
        $this->limit = $limit ;
       
        if( !fopen( $url, "r") ) return false;
        else $this->_spider( $url );
    }

	function _spider( $url )
    {
        $this->cache = @file_get_contents( urldecode( $url ) );
        if( !$this->cache ) return false;
        $this->crawled[] = urldecode( $url ) ;
        preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links );
        if ( $links ) :
            foreach ( $links[1] as $hyperlink )
            {
                $this->limit--;
                if( ! $this->limit  ) return;
               
                if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) :
                    $this->crawled[] = $hyperlink;
                    echo "Crawling $hyperlink<br />\n";
					garble($hyperlink);
					unset( $this->cache );
                    $this->_spider( $hyperlink );
                endif;
            }
        endif;
    }
    function is_valid_ext( $url )
    {   
        foreach( $this->banned_ext as $ext )
        {
            if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
        }
        return true;
    }
    function is_crawled( $url )
    {
        return in_array( $url, $this->crawled );
    }
}

$banned_ext = array
(
    ".dtd",
    ".css",
    ".xml",
    ".js",
    ".gif",
    ".jpg",
    ".jpeg",
    ".bmp",
    ".ico",
    ".rss",
	".pdf",
	".png",
	".psd",
	".aspx",
	".jsp",
	".srf",
	".cgi",
	".exe",
	".cfm"
);
$spider = new spider_man( 'http://www.audubon.org/bird/', $banned_ext, 10000 );
print_r( $spider->crawled );


?>

Code (markup):

advancedfuture, Mar 26, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#17

/*
 The reason for that is quite simple, take a look here ...
*/
foreach ( $links[1] as $hyperlink ) // $links[1] is all of the links in the page that the spider last crawled
 {
 $this->limit--;
 if( ! $this->limit ) return;
 
 if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) : 
 $this->crawled[] = $hyperlink;
// BUT if the link is valid before it gets to $links[1][1] it will execute this function again
 echo "Crawling $hyperlink \n";
					garble($hyperlink);
					unset( $this->cache );
 $this->_spider( $hyperlink );
 endif;
 }
PHP:
That's what I meant by a cache, you wanna crawl all these pages one by one, and all of each page, what you need to do is read all of each page and cache every link before crawling the first link it finds, I'd say it's a little more complicated than just one class and some functions, not only that but massively server intensive, maybe it would work out for you to make seperate classes to crawl the links and cache them and then retrieve each link from the cache and get the data you need from it.

krakjoe, Mar 26, 2007 IP

flortl Peon

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#18

I want a php script that search for mp3 links or for music lyrics on different websites and after that the results to be added into my mysql database.

Can any do that ?

Thanks.

flortl, Nov 7, 2007 IP

Make a perfect site Well-Known Member

Messages:: 376

Likes Received:: 9

Best Answers:: 0

Trophy Points:: 155

#19

aredhelrim said: ↑

Try this :
<?php
// aredhelrim
$seed = $http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/";
$html = file_get_contents($seed);
echo "Page : " . $seed;
preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
echo " links : " . $val[0] . "\r\n";

}
?>
Click to expand...

That doesn't work!
Try this instead:
<?php
// aredhelrim
$seed = "http://www.google.com";
$html = file_get_contents($seed);
echo "Page : " . $seed;
preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
echo " links : " . $val[0] . "\r\n";


}
?> 
PHP:

Make a perfect site, Jun 15, 2011 IP

BRUm Well-Known Member

Messages:: 3,086

Likes Received:: 61

Best Answers:: 1

Trophy Points:: 100

#20

Don't use PHP for this, it's simply too slow.

Even Java is much faster than this. Use Java or C#.

BRUm, Jun 15, 2011 IP

Log in or Sign up

simple php spider

advancedfuture Banned

DomainMagnate Illustrious Member

advancedfuture Banned

advancedfuture Banned

aredhelrim Peon

NickD Well-Known Member

advancedfuture Banned

krakjoe Well-Known Member

advancedfuture Banned

krakjoe Well-Known Member

advancedfuture Banned

krakjoe Well-Known Member

advancedfuture Banned

krakjoe Well-Known Member

advancedfuture Banned

advancedfuture Banned

krakjoe Well-Known Member

flortl Peon

Make a perfect site Well-Known Member

BRUm Well-Known Member

Useful Searches