Trying to write a simple spider that will follow all links and print the link out on the screen. its not working correctly. It is only printing the first seed link then quitting. Whats wrong here? <?php $seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/'; spider_man($seed); function spider_man($url) { echo "Following ".$url." <br>\n"; $data = file_get_contents($seed); if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) { foreach ($links[0] as $link) { spider_man($link); } } } ?> Code (markup):
err i'm not much of a programmer, but there's a chance you'll spider all the internet with that recursive function
ahh I found out what was wrong.... I was continually calling $seed. When the line should have been: $data = file_get_contents($url); Code (markup): HOWEVER After 30 seconds of it spitting out results I get. Anyway around this?
Try this : <?php // aredhelrim $seed = $http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/"; $html = file_get_contents($seed); echo "Page : " . $seed; preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER); foreach ($matches as $val) { echo "<br><font color=red>links :</font> " . $val[0] . "\r\n"; } ?>
that worked.... the only other issue now is on alot of websites... like dmoz/wiki... it gets stuck in a never ending loop trying to spider http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd Im not very good with preg_match. How would I go about making it so it ignored dtd and css files?
<? set_time_limit( 0 ); class spider_man { var $limit; var $cache; var $crawled; var $banned_ext; function spider_man( $url, $banned_ext, $limit ) { $this->start = $url ; $this->banned_ext = $banned_ext ; $this->limit = $limit ; if( !fopen( $url, "r") ) return false; else $this->_spider( $url ); } function _spider( $url ) { $this->cache = @file_get_contents( urldecode( $url ) ); if( !$this->cache ) return false; $this->crawled[] = urldecode( $url ) ; preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links ); if ( $links ) : foreach ( $links[1] as $hyperlink ) { $this->limit--; if( ! $this->limit ) return; if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) : $this->crawled[] = $hyperlink; echo "Crawling $hyperlink<br />\n"; unset( $this->cache ); $this->_spider( $hyperlink ); endif; } endif; } function is_valid_ext( $url ) { foreach( $this->banned_ext as $ext ) { if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false; } return true; } function is_crawled( $url ) { return in_array( $url, $this->crawled ); } } $banned_ext = array ( ".dtd", ".css", ".xml", ".js", ".gif", ".jpg", ".jpeg", ".bmp", ".ico", ".rss" ); $spider = new spider_man( 'http://www.msn.com/', $banned_ext, 100 ); print_r( $spider->crawled ); ?> PHP: that works, you'll prolly need to edit it some more, but that's how I would achieve such a thing .....
wow holy crap, mad props: LoL And you whipped it together in no time, I wasn't expecting this! Thank you, it actually works flawlessly. Now its time for me to dissect it some more And then incorporate my source.. I have a really awesome project I'm working on right now
if you post in these php forums and the post isn't boring as hell, more often than not my first post will be the solution, it's how I keep sharp. See sig : I'm not playin Thought you might have to, I can't really see what use it would be as is, out of curiosity whats the project ?
PHP kills your script after its consumed a max of 8mb of memory. You know if there is a way around that restriction? PHP.ini says max is 8mb... I set it to 8000 but it does not matter.
ini_set("memory_limit", 128 ); it's gotta be a amount of memory that's actually available else it'll fall back to system default which is 8mb, do echo ini_get("memory_limit"); exit; after you set it to test if it's possible for you to raise limits at all. It's going to be memory intensive, you could try writing cache files and using ini's instead of self::$cache and self::$crawled, if you're a bit more specific about what it'll do in the end I can be more specific about how to go about it.
I'm going to have the spider crawl links, grab a small portion of content from each page. Then store the content and URL in MySQL. All in all this is going to be a very stripped down search engine. Nothing fancy or elaborate. I am trying to shoot for getting lets say 1000-10000 results without my computer taking a dump. But then again, there may NOT be an efficient way of doing this with PHP. If I wanted to get crazy I really would have to do this in C++. I know basic C++ but I think it would be beyond me at this point to do something like this in C++.
if you come at me on msn, I got some time ..... it's possible, using mysql is even better you got mysql_free_result on your side, theres no reason for it to take up any substantial amount of memory, + I got something I'd like to try ......
sure I'll hop on. I have to remember my MSN info first. Can't say I am going to follow much right now. Would you believe I've been up since friday morning without sleep? lol. Been trying to get another web project done by tonight.
Well I have been playing around with the spider working on getting it to do what I need. So far I'm getting some query problems. The program opened up the DB, wrote the first URLs worth of garbled data. Closed. Then will not write the data for the next links. Anyways hit me up with some idea's about using sql to cache all the links on the page for spidering. As it is write now it seems to only spider the first few links on the page then move on to others. <? set_time_limit( 0 ); ini_set("memory_limit", 128 ); //strip HTML and Javascript function html2txt($document){ $search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@<![\s\S]*?--[ \t\n\r]*>@' // Strip multi-line comments including CDATA ); $text = preg_replace($search, '', $document); return $text; } //open db and write garbled data function write_data($data) { require("connectDB.php"); $query = "INSERT INTO repository (data) ". "VALUES ('$data')"; mysql_query($query) or die('Error, query failed'); require("disconnectDB.php"); } //garble the data that we get from the website. function garble($hyperlink) { $html = implode(" ", file($hyperlink)); $html = strip_tags(html2txt($html)); $s = explode(" ",$html); shuffle($s); $new_html = implode(" ", $s); //echo "<br />" . $new_html . "<br />"; write_data($new_html); } class spider_man { var $limit; var $cache; var $crawled; var $banned_ext; function spider_man( $url, $banned_ext, $limit ) { $this->start = $url ; $this->banned_ext = $banned_ext ; $this->limit = $limit ; if( !fopen( $url, "r") ) return false; else $this->_spider( $url ); } function _spider( $url ) { $this->cache = @file_get_contents( urldecode( $url ) ); if( !$this->cache ) return false; $this->crawled[] = urldecode( $url ) ; preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links ); if ( $links ) : foreach ( $links[1] as $hyperlink ) { $this->limit--; if( ! $this->limit ) return; if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) : $this->crawled[] = $hyperlink; echo "Crawling $hyperlink<br />\n"; garble($hyperlink); unset( $this->cache ); $this->_spider( $hyperlink ); endif; } endif; } function is_valid_ext( $url ) { foreach( $this->banned_ext as $ext ) { if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false; } return true; } function is_crawled( $url ) { return in_array( $url, $this->crawled ); } } $banned_ext = array ( ".dtd", ".css", ".xml", ".js", ".gif", ".jpg", ".jpeg", ".bmp", ".ico", ".rss", ".pdf", ".png", ".psd", ".aspx", ".jsp", ".srf", ".cgi", ".exe", ".cfm" ); $spider = new spider_man( 'http://www.audubon.org/bird/', $banned_ext, 10000 ); print_r( $spider->crawled ); ?> Code (markup):
/* The reason for that is quite simple, take a look here ... */ foreach ( $links[1] as $hyperlink ) // $links[1] is all of the links in the page that the spider last crawled { $this->limit--; if( ! $this->limit ) return; if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) : $this->crawled[] = $hyperlink; // BUT if the link is valid before it gets to $links[1][1] it will execute this function again echo "Crawling $hyperlink<br />\n"; garble($hyperlink); unset( $this->cache ); $this->_spider( $hyperlink ); endif; } PHP: That's what I meant by a cache, you wanna crawl all these pages one by one, and all of each page, what you need to do is read all of each page and cache every link before crawling the first link it finds, I'd say it's a little more complicated than just one class and some functions, not only that but massively server intensive, maybe it would work out for you to make seperate classes to crawl the links and cache them and then retrieve each link from the cache and get the data you need from it.
I want a php script that search for mp3 links or for music lyrics on different websites and after that the results to be added into my mysql database. Can any do that ? Thanks.
That doesn't work! Try this instead: <?php // aredhelrim $seed = "http://www.google.com"; $html = file_get_contents($seed); echo "Page : " . $seed; preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER); foreach ($matches as $val) { echo "<br><font color=red>links :</font> " . $val[0] . "\r\n"; } ?> PHP: