1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

simple php spider

Discussion in 'PHP' started by advancedfuture, Mar 25, 2007.

  1. #1
    Trying to write a simple spider that will follow all links and print the link out on the screen.

    its not working correctly. It is only printing the first seed link then quitting.

    Whats wrong here?


    <?php
    
    $seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';
    spider_man($seed);
    
    function spider_man($url)
    {
    	echo "Following ".$url." <br>\n";
    	$data = file_get_contents($seed);
    	if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
    	{
    		foreach ($links[0] as $link) 
    		{
    			spider_man($link);
    		}
    	}
    }
    
    ?>
    Code (markup):
     
    advancedfuture, Mar 25, 2007 IP
  2. DomainMagnate

    DomainMagnate Illustrious Member

    Messages:
    10,932
    Likes Received:
    1,022
    Best Answers:
    0
    Trophy Points:
    455
    #2
    err i'm not much of a programmer, but there's a chance you'll spider all the internet with that recursive function :p
     
    DomainMagnate, Mar 25, 2007 IP
  3. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #3
    thats what my aim is....

    well... at least to spider until I decided to kill the page.
     
    advancedfuture, Mar 25, 2007 IP
  4. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #4
    ahh I found out what was wrong....
    I was continually calling $seed. When the line should have been:

    $data = file_get_contents($url);
    Code (markup):
    HOWEVER

    After 30 seconds of it spitting out results I get.

    Anyway around this?
     
    advancedfuture, Mar 25, 2007 IP
  5. aredhelrim

    aredhelrim Peon

    Messages:
    7
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    :) Try this :
    <?php
    // aredhelrim
    $seed = $http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/";
    $html = file_get_contents($seed);
    echo "Page : " . $seed;
    preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);

    foreach ($matches as $val) {
    echo "<br><font color=red>links :</font> " . $val[0] . "\r\n";


    }
    ?>
     
    aredhelrim, Mar 25, 2007 IP
  6. NickD

    NickD Well-Known Member

    Messages:
    262
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    130
    #6
    set_time_limit(0);
     
    NickD, Mar 25, 2007 IP
  7. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #7
    that worked....

    the only other issue now is on alot of websites... like dmoz/wiki...

    it gets stuck in a never ending loop trying to spider

    http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

    Im not very good with preg_match. How would I go about making it so it ignored dtd and css files?
     
    advancedfuture, Mar 25, 2007 IP
  8. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #8
    
    <?
    set_time_limit( 0 );
    class spider_man
    {
    	var $limit;
    	var $cache;
    	var $crawled;
    	var $banned_ext;
    	
    	function spider_man( $url, $banned_ext, $limit )
    	{
    		$this->start = $url ;
    		$this->banned_ext = $banned_ext ;
    		$this->limit = $limit ;
    		
    		if( !fopen( $url, "r") ) return false;
    		else $this->_spider( $url );
    	}
    	function _spider( $url )
    	{
    		$this->cache = @file_get_contents( urldecode( $url ) );
    		if( !$this->cache ) return false;
    		$this->crawled[] = urldecode( $url ) ;
    		preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links );
    		if ( $links ) :
    			foreach ( $links[1] as $hyperlink )
    			{
    				$this->limit--;
    				if( ! $this->limit  ) return;
    				
    				if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) :
    					$this->crawled[] = $hyperlink;
    					echo "Crawling $hyperlink<br />\n";
    					unset( $this->cache );
    					$this->_spider( $hyperlink );
    				endif;
    			}
    		endif;
    	}
    	function is_valid_ext( $url )
    	{	
    		foreach( $this->banned_ext as $ext )
    		{
    			if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
    		}
    		return true;
    	}
    	function is_crawled( $url )
    	{
    		return in_array( $url, $this->crawled );
    	}
    }
    
    $banned_ext = array
    (
    	".dtd",
    	".css",
    	".xml",
    	".js",
    	".gif",
    	".jpg",
    	".jpeg",
    	".bmp",
    	".ico",
    	".rss"
    );
    $spider = new spider_man( 'http://www.msn.com/', $banned_ext, 100 );
    print_r( $spider->crawled );
    ?>
    
    PHP:
    that works, you'll prolly need to edit it some more, but that's how I would achieve such a thing .....
     
    krakjoe, Mar 25, 2007 IP
    advancedfuture likes this.
  9. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #9
    wow holy crap, mad props:

    LoL

    And you whipped it together in no time, I wasn't expecting this! Thank you, it actually works flawlessly. Now its time for me to dissect it some more :D And then incorporate my source.. I have a really awesome project I'm working on right now :)
     
    advancedfuture, Mar 25, 2007 IP
  10. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #10
    if you post in these php forums and the post isn't boring as hell, more often than not my first post will be the solution, it's how I keep sharp.

    See sig : I'm not playin ;)

    Thought you might have to, I can't really see what use it would be as is, out of curiosity whats the project ?
     
    krakjoe, Mar 25, 2007 IP
  11. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #11
    PHP kills your script after its consumed a max of 8mb of memory. You know if there is a way around that restriction? PHP.ini says max is 8mb... I set it to 8000 but it does not matter.
     
    advancedfuture, Mar 25, 2007 IP
  12. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #12
    ini_set("memory_limit", 128 );

    it's gotta be a amount of memory that's actually available else it'll fall back to system default which is 8mb, do echo ini_get("memory_limit"); exit; after you set it to test if it's possible for you to raise limits at all.

    It's going to be memory intensive, you could try writing cache files and using ini's instead of self::$cache and self::$crawled, if you're a bit more specific about what it'll do in the end I can be more specific about how to go about it.
     
    krakjoe, Mar 25, 2007 IP
  13. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #13
    I'm going to have the spider crawl links, grab a small portion of content from each page. Then store the content and URL in MySQL. All in all this is going to be a very stripped down search engine. Nothing fancy or elaborate. I am trying to shoot for getting lets say 1000-10000 results without my computer taking a dump. But then again, there may NOT be an efficient way of doing this with PHP.

    If I wanted to get crazy I really would have to do this in C++. I know basic C++ but I think it would be beyond me at this point to do something like this in C++.
     
    advancedfuture, Mar 25, 2007 IP
  14. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #14
    if you come at me on msn, I got some time ..... it's possible, using mysql is even better you got mysql_free_result on your side, theres no reason for it to take up any substantial amount of memory, + I got something I'd like to try ......
     
    krakjoe, Mar 25, 2007 IP
  15. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #15
    sure I'll hop on. I have to remember my MSN info first. :D Can't say I am going to follow much right now. Would you believe I've been up since friday morning without sleep? lol. Been trying to get another web project done by tonight.
     
    advancedfuture, Mar 25, 2007 IP
  16. advancedfuture

    advancedfuture Banned

    Messages:
    481
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Well I have been playing around with the spider working on getting it to do what I need.

    So far I'm getting some query problems.

    The program opened up the DB, wrote the first URLs worth of garbled data. Closed. Then will not write the data for the next links. :confused:

    Anyways hit me up with some idea's about using sql to cache all the links on the page for spidering. As it is write now it seems to only spider the first few links on the page then move on to others.

    
    <?
    set_time_limit( 0 );
    ini_set("memory_limit", 128 );
    
    //strip HTML and Javascript
    function html2txt($document){
    $search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
                   '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
                   '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
                   '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
    );
    $text = preg_replace($search, '', $document);
    return $text;
    }
    
    //open db and write garbled data
    function write_data($data)
    {
    	require("connectDB.php");
    	$query = "INSERT INTO repository (data) ". "VALUES ('$data')";
    	mysql_query($query) or die('Error, query failed');
    	require("disconnectDB.php");
    }
    
    //garble the data that we get from the website.
    function garble($hyperlink)
    {
    		$html = implode(" ", file($hyperlink));
    		$html = strip_tags(html2txt($html));
    		$s = explode(" ",$html);
    		shuffle($s);
    		$new_html = implode(" ", $s);
    		//echo "<br />" . $new_html . "<br />";
    		write_data($new_html);
    }
    
    
    class spider_man
    {
        var $limit;
        var $cache;
        var $crawled;
        var $banned_ext;
       
        function spider_man( $url, $banned_ext, $limit )
        {
            $this->start = $url ;
            $this->banned_ext = $banned_ext ;
            $this->limit = $limit ;
           
            if( !fopen( $url, "r") ) return false;
            else $this->_spider( $url );
        }
    
    	function _spider( $url )
        {
            $this->cache = @file_get_contents( urldecode( $url ) );
            if( !$this->cache ) return false;
            $this->crawled[] = urldecode( $url ) ;
            preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links );
            if ( $links ) :
                foreach ( $links[1] as $hyperlink )
                {
                    $this->limit--;
                    if( ! $this->limit  ) return;
                   
                    if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) :
                        $this->crawled[] = $hyperlink;
                        echo "Crawling $hyperlink<br />\n";
    					garble($hyperlink);
    					unset( $this->cache );
                        $this->_spider( $hyperlink );
                    endif;
                }
            endif;
        }
        function is_valid_ext( $url )
        {   
            foreach( $this->banned_ext as $ext )
            {
                if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
            }
            return true;
        }
        function is_crawled( $url )
        {
            return in_array( $url, $this->crawled );
        }
    }
    
    $banned_ext = array
    (
        ".dtd",
        ".css",
        ".xml",
        ".js",
        ".gif",
        ".jpg",
        ".jpeg",
        ".bmp",
        ".ico",
        ".rss",
    	".pdf",
    	".png",
    	".psd",
    	".aspx",
    	".jsp",
    	".srf",
    	".cgi",
    	".exe",
    	".cfm"
    );
    $spider = new spider_man( 'http://www.audubon.org/bird/', $banned_ext, 10000 );
    print_r( $spider->crawled );
    
    
    ?>
    
    Code (markup):
     
    advancedfuture, Mar 26, 2007 IP
  17. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #17
    
    /*
     The reason for that is quite simple, take a look here ...
    */
    foreach ( $links[1] as $hyperlink ) // $links[1] is all of the links in the page that the spider last crawled
                {
                    $this->limit--;
                    if( ! $this->limit  ) return;
                   
                    if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) : 
                        $this->crawled[] = $hyperlink;
    // BUT if the link is valid before it gets to $links[1][1] it will execute this function again
                        echo "Crawling $hyperlink<br />\n";
    					garble($hyperlink);
    					unset( $this->cache );
                        $this->_spider( $hyperlink );
                    endif;
                }
    
    PHP:
    That's what I meant by a cache, you wanna crawl all these pages one by one, and all of each page, what you need to do is read all of each page and cache every link before crawling the first link it finds, I'd say it's a little more complicated than just one class and some functions, not only that but massively server intensive, maybe it would work out for you to make seperate classes to crawl the links and cache them and then retrieve each link from the cache and get the data you need from it.
     
    krakjoe, Mar 26, 2007 IP
  18. flortl

    flortl Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #18
    I want a php script that search for mp3 links or for music lyrics on different websites and after that the results to be added into my mysql database.

    Can any do that ?

    Thanks.
     
    flortl, Nov 7, 2007 IP
  19. Make a perfect site

    Make a perfect site Well-Known Member

    Messages:
    376
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    155
    #19
    That doesn't work!
    Try this instead:

    <?php
    // aredhelrim
    $seed = "http://www.google.com";
    $html = file_get_contents($seed);
    echo "Page : " . $seed;
    preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);
    
    foreach ($matches as $val) {
    echo "<br><font color=red>links :</font> " . $val[0] . "\r\n";
    
    
    }
    ?> 
    PHP:
     
    Make a perfect site, Jun 15, 2011 IP
  20. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #20
    Don't use PHP for this, it's simply too slow.

    Even Java is much faster than this. Use Java or C#.
     
    BRUm, Jun 15, 2011 IP