PHP crawler script

Discussion in 'PHP' started by WeedGrinch, Feb 15, 2008.

  1. #1
    I am looking for a very simple php script that crawls a site, and grabs all of the links. I can't find any good scripts, and I've tried at least 10.

    If you could help me with this, or even guide me in the right direction (what functions, etc) I would appreciate it. Thanks.
     
    WeedGrinch, Feb 15, 2008 IP
  2. zerxer

    zerxer Peon

    Messages:
    368
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You want simple? Okay, here's a really simple one that I just whipped up for ya. :)

    
    <?php
      $original_file = file_get_contents("http://www.domain.com");
      $stripped_file = strip_tags($original_file, "<a>");
      preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
    
      //DEBUGGING
    
      //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
      //$matches[1] now contains only the HREFs in the A tags; ex: link
    
      header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
      print_r($matches); //View the array to see if it worked
    ?>
    PHP:
    You would remove everything after //DEBUGGING when actually using it though. It's just for you to see how it works if you put it in a PHP file by itself for testing.

    Only took 3 lines.. not counting debugging.
     
    zerxer, Feb 15, 2008 IP
  3. hasan_889

    hasan_889 Banned

    Messages:
    303
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    It's just showing 1 page ... it's not crawling deep!
     
    hasan_889, Feb 15, 2008 IP
  4. zerxer

    zerxer Peon

    Messages:
    368
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Well, I only made it to rip the links off of one page. You can set it up so those 3 lines are in a function and it returns the $matches[1] array, and then you can loop through that array and call the script again for each link in that array value (which is a link itself) so that it keeps crawling.

    If you'd like to see an example of what those 3 lines do, go to php.sitexero.net/?preview=link_crawler
     
    zerxer, Feb 15, 2008 IP
  5. redhits

    redhits Notable Member

    Messages:
    3,023
    Likes Received:
    277
    Best Answers:
    0
    Trophy Points:
    255
    #5
    learn some php and create your own one :)
     
    redhits, Feb 15, 2008 IP
  6. zerxer

    zerxer Peon

    Messages:
    368
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Yeah.. I did the hard part. Looping through them should be a breeze. Anyways, I'm still pretty bored right now and don't want to start on any of my major projects, so I guess I can quickly edit the example to dig X amount of times.

    EDIT: Nevermind.. I was over halfway done when I decided that doing that was very unstable. It would take forever to load. I think my example on how to dig one page is enough. I made a few modifications to it, though. Check php.sitexero.net/?code=link_crawler
     
    zerxer, Feb 15, 2008 IP
  7. alimkb

    alimkb Member

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    36
    #7
    I think "PHP Crawler" would be useful . Its simple php based web crawler
    sourceforge.net/projects/php-crawler/
     
    alimkb, Feb 16, 2008 IP
  8. Estevan

    Estevan Peon

    Messages:
    120
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    0
    #8
    hello
    here is a simple example

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,"http://www.urlyourstart.com");
    curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $result=curl_exec ($ch);
    curl_close ($ch);
    // Search The Results From The Starting Site
    if( $result )
    {
    // I LOOK ONLY FROM TOP domains change this for your usage
    preg_match_all( '/<a href="(http:\/\/www.[^0-9].+?)"/', $result, $output, PREG_SET_ORDER );

    foreach( $output as $item )

    {
    // ALL LINKS DISPLAY HERE
    print_r($item);

    // NOW YOU ADD IN YOU DATABASE AND MAKE A LOOP TO ENGINE NEVER STOP


    }

    }

    maybe help you
     
    Estevan, Feb 16, 2008 IP
  9. WeedGrinch

    WeedGrinch Active Member

    Messages:
    1,236
    Likes Received:
    73
    Best Answers:
    0
    Trophy Points:
    90
    #9
    I found something. Thanks to everyone who replied!
     
    WeedGrinch, Feb 16, 2008 IP
  10. indianseo

    indianseo Peon

    Messages:
    208
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Could you please what you found and how it is? Thank you.
     
    indianseo, Feb 19, 2008 IP
  11. CATTechnologies

    CATTechnologies Guest

    Messages:
    13
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Use CURL Functions to get data and manipulate the data as you need
    And get Links.





    For more information: cattechnologies.com
     
    CATTechnologies, Feb 21, 2008 IP
  12. kaviarasankk

    kaviarasankk Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Wow really working super
    zerxer.. Gr8
     
    kaviarasankk, Jul 14, 2010 IP
  13. Deacalion

    Deacalion Peon

    Messages:
    438
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #13
    ugh... seriously old thread. Necrophilia dude.
     
    Deacalion, Jul 14, 2010 IP
  14. kaviarasankk

    kaviarasankk Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #14
    huh yeah your right
    but its...
     
    kaviarasankk, Jul 14, 2010 IP
  15. themullet

    themullet Member

    Messages:
    110
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    26
    #15
    a nice bit of regex
     
    themullet, Jul 14, 2010 IP
  16. priyanka-mepco

    priyanka-mepco Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #16
    We are developing a search engine. for that we are in need of code for web crawler or something related to that, for getting automatically the websites, without manually entering it to database.
     
    priyanka-mepco, Feb 28, 2011 IP
  17. priyanka-mepco

    priyanka-mepco Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #17
    please help us as soon as possible, if you know......
     
    priyanka-mepco, Feb 28, 2011 IP
  18. aioarticles

    aioarticles Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Can you using simple_html_dom.php?
     
    aioarticles, Aug 25, 2011 IP
  19. sojic

    sojic Active Member

    Messages:
    133
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    90
    #19
    Avoid using simple_html_dom for crawling. It takes a lot of memory and the script crashes. Custom crawler using regex is the best.
     
    sojic, Aug 25, 2011 IP
  20. Balajink

    Balajink Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #20
    HI all,

    i need a simple crawl script in php which to fetch the category,image,description,keywords,title,meta,price,mrp of a ecommerce website.. and store it in mysql database....so please reply me
     
    Balajink, Dec 17, 2011 IP