How to make a web crawler in PHP ?

Discussion in 'PHP' started by viking.kid07, Nov 13, 2009.

  1. #1
    Hello everyone

    I am in the process of making my project and stuck up right now at this point . I need to make a web crawler in PHP and link it to my site . Its really urgent for me . please can somebody suggest me a solution out of this problem . I would be really grateful to you ..! please friends help me out .! waiting for a reply.!

    Thanx.

    You can easily find me on:

    Gmail :
    Yahoo :
    --
    Shekhar
     
    viking.kid07, Nov 13, 2009 IP
  2. Bohra

    Bohra Prominent Member

    Messages:
    12,573
    Likes Received:
    537
    Best Answers:
    0
    Trophy Points:
    310
    #2
    Bohra, Nov 13, 2009 IP
  3. bonecone

    bonecone Peon

    Messages:
    54
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    $output = file_get_contents('http://www.website.com'); will assign the source code of a website to a variable. Then you can search it with string functions or regular expressions.
     
    bonecone, Nov 14, 2009 IP
  4. kmussel

    kmussel Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Here is a good tutorial to get you started. It uses cURL to get the page. It then uses preg_match to get all the links on the page and follows each link. You can check it out at http://kevinmusselman.com/blog/2009/11/crawling-web-pages-for-sitemaps/
     
    kmussel, Nov 29, 2009 IP
  5. fireworking

    fireworking Peon

    Messages:
    460
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #5
    First though, you should study how crawlers work before making it in PHP.

    If you are trying to crawl other websites, you should first learn cURL which is a good php extension that you can use to browse other sites through php.

    Reading the links might be trick thoguh because you have to see if there was a nofollow or not but just take it on step at a time.

    If you are trying to crawl your own website, it is just best to make a program that generates tags from your content. Or, even quicker dont reinvent the wheel but use a framework.

    You might want to see this site:
    http://www.bitrepository.com/how-to-create-a-simple-web-data-extractor.html
     
    fireworking, Nov 29, 2009 IP