1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP crawler script

Discussion in 'PHP' started by WeedGrinch, Feb 15, 2008.

  1. #1
    I am looking for a very simple php script that crawls a site, and grabs all of the links. I can't find any good scripts, and I've tried at least 10.

    If you could help me with this, or even guide me in the right direction (what functions, etc) I would appreciate it. Thanks.
     
    WeedGrinch, Feb 15, 2008 IP
  2. zerxer

    zerxer Peon

    Messages:
    368
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You want simple? Okay, here's a really simple one that I just whipped up for ya. :)

    
    <?php
      $original_file = file_get_contents("http://www.domain.com");
      $stripped_file = strip_tags($original_file, "<a>");
      preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
    
      //DEBUGGING
    
      //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
      //$matches[1] now contains only the HREFs in the A tags; ex: link
    
      header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
      print_r($matches); //View the array to see if it worked
    ?>
    PHP:
    You would remove everything after //DEBUGGING when actually using it though. It's just for you to see how it works if you put it in a PHP file by itself for testing.

    Only took 3 lines.. not counting debugging.
     
    zerxer, Feb 15, 2008 IP
  3. hasan_889

    hasan_889 Banned

    Messages:
    303
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    It's just showing 1 page ... it's not crawling deep!
     
    hasan_889, Feb 15, 2008 IP
  4. zerxer

    zerxer Peon

    Messages:
    368
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Well, I only made it to rip the links off of one page. You can set it up so those 3 lines are in a function and it returns the $matches[1] array, and then you can loop through that array and call the script again for each link in that array value (which is a link itself) so that it keeps crawling.

    If you'd like to see an example of what those 3 lines do, go to php.sitexero.net/?preview=link_crawler
     
    zerxer, Feb 15, 2008 IP
  5. redhits

    redhits Notable Member

    Messages:
    3,023
    Likes Received:
    277
    Best Answers:
    0
    Trophy Points:
    255
    #5
    learn some php and create your own one :)
     
    redhits, Feb 15, 2008 IP
  6. zerxer

    zerxer Peon

    Messages:
    368
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Yeah.. I did the hard part. Looping through them should be a breeze. Anyways, I'm still pretty bored right now and don't want to start on any of my major projects, so I guess I can quickly edit the example to dig X amount of times.

    EDIT: Nevermind.. I was over halfway done when I decided that doing that was very unstable. It would take forever to load. I think my example on how to dig one page is enough. I made a few modifications to it, though. Check php.sitexero.net/?code=link_crawler
     
    zerxer, Feb 15, 2008 IP
  7. alimkb

    alimkb Member

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    36
    #7
    I think "PHP Crawler" would be useful . Its simple php based web crawler
    sourceforge.net/projects/php-crawler/
     
    alimkb, Feb 16, 2008 IP
  8. Estevan

    Estevan Peon

    Messages:
    120
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    0
    #8
    hello
    here is a simple example

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,"http://www.urlyourstart.com");
    curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $result=curl_exec ($ch);
    curl_close ($ch);
    // Search The Results From The Starting Site
    if( $result )
    {
    // I LOOK ONLY FROM TOP domains change this for your usage
    preg_match_all( '/<a href="(http:\/\/www.[^0-9].+?)"/', $result, $output, PREG_SET_ORDER );

    foreach( $output as $item )

    {
    // ALL LINKS DISPLAY HERE
    print_r($item);

    // NOW YOU ADD IN YOU DATABASE AND MAKE A LOOP TO ENGINE NEVER STOP


    }

    }

    maybe help you
     
    Estevan, Feb 16, 2008 IP
  9. WeedGrinch

    WeedGrinch Active Member

    Messages:
    1,236
    Likes Received:
    73
    Best Answers:
    0
    Trophy Points:
    90
    #9
    I found something. Thanks to everyone who replied!
     
    WeedGrinch, Feb 16, 2008 IP
  10. indianseo

    indianseo Peon

    Messages:
    208
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Could you please what you found and how it is? Thank you.
     
    indianseo, Feb 19, 2008 IP
  11. CATTechnologies

    CATTechnologies Guest

    Messages:
    13
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Use CURL Functions to get data and manipulate the data as you need
    And get Links.





    For more information: cattechnologies.com
     
    CATTechnologies, Feb 21, 2008 IP
  12. kaviarasankk

    kaviarasankk Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Wow really working super
    zerxer.. Gr8
     
    kaviarasankk, Jul 14, 2010 IP
  13. Deacalion

    Deacalion Peon

    Messages:
    438
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #13
    ugh... seriously old thread. Necrophilia dude.
     
    Deacalion, Jul 14, 2010 IP
  14. kaviarasankk

    kaviarasankk Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #14
    huh yeah your right
    but its...
     
    kaviarasankk, Jul 14, 2010 IP
  15. themullet

    themullet Member

    Messages:
    110
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    26
    #15
    a nice bit of regex
     
    themullet, Jul 14, 2010 IP
  16. priyanka-mepco

    priyanka-mepco Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #16
    We are developing a search engine. for that we are in need of code for web crawler or something related to that, for getting automatically the websites, without manually entering it to database.
     
    priyanka-mepco, Feb 28, 2011 IP
  17. priyanka-mepco

    priyanka-mepco Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #17
    please help us as soon as possible, if you know......
     
    priyanka-mepco, Feb 28, 2011 IP
  18. aioarticles

    aioarticles Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Can you using simple_html_dom.php?
     
    aioarticles, Aug 25, 2011 IP
  19. sojic

    sojic Active Member

    Messages:
    133
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    90
    #19
    Avoid using simple_html_dom for crawling. It takes a lot of memory and the script crashes. Custom crawler using regex is the best.
     
    sojic, Aug 25, 2011 IP
  20. Balajink

    Balajink Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #20
    HI all,

    i need a simple crawl script in php which to fetch the category,image,description,keywords,title,meta,price,mrp of a ecommerce website.. and store it in mysql database....so please reply me
     
    Balajink, Dec 17, 2011 IP