Simple web crawler

Discussion in 'PHP' started by Make a perfect site, Jun 15, 2011.

  1. #1
    Hi all,

    I am just developing a very simple web spider/crawler. Here is the code:

    <?php
    
    $seed = "http://www.akosblog.com";
    $html = file_get_contents($seed);
    echo "Page : " . $seed;
    preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);
    
    foreach ($matches as $val) {
    echo "<br><font color=red>links :</font> " . $val[0] . "\r\n";
    
    
    }
    ?>
    PHP:
    This code just gets all the links from the selected page. Now I want to move on, I want the spider to follow links and index another link and another.
    So how could I do that?

    Regards,
    Akos
     
    Make a perfect site, Jun 15, 2011 IP
  2. idlecool

    idlecool Greenhorn

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    11
    #2
    nice post. I dont know much of PHP so it is like one more small step into it :)

    btw, I think regex are bitches. you never know easily where you get wrong and still more PITA is when you need do some modifications in the old code. There are some excellent parsing libraries available in different languages. I think PHP has it too, well let me know if you find one ;) , I basically code in Python and found `lxml`, it does its jobs awesomely :) you should look into it if you happen to have free time.
     
    idlecool, Jun 16, 2011 IP
  3. G3n3s!s

    G3n3s!s Active Member

    Messages:
    325
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    80
    #3
    Put each link in database and in next phase, you'll have to select them all and do same proccess with each link.
     
    G3n3s!s, Jun 16, 2011 IP