Unique Crawler Script

Discussion in 'PHP' started by jasonxxx102, Jan 8, 2013.

  1. #1
    I have a basic PHP web crawler script and I need to expand its functionality, the problem is I'm a total noob at PHP and my knowledge is very basic so I'm coming here for some help.

    My goal is to have a basic user input (text box) and when the user types in a phrase; let's say "Red Apples" and hits the enter button the script should start crawling the web for the phrase "Red Apples" and store the plain text results along with the URL they originated from in a database.

    Here is what I've got so far:

    error_reporting( E_ERROR );
     
    define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
     
    
    $domains = array();
    
    $urls = array();
     
    function crawl( $url )
    {
      global $domains, $urls;
     
      echo "Crawling $url... ";
     
      $parse = parse_url( $url );
    
      $domains[ $parse['host'] ]++;
      $urls[] = $url;
     
      $content = file_get_contents( $url );
      if ( $content === FALSE )
      {
        echo "Error.\n";
        return;
      }
     
     
      $content = stristr( $content, "body" );
      preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );
     
      echo 'Found ' . count( $matches[0] ) . " urls.\n";
     
      foreach( $matches[0] as $crawled_url )
      {
        $parse = parse_url( $crawled_url );
     
        if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN
            && !in_array( $crawled_url, $urls ) )
        {
          sleep( 1 );
          crawl( $crawled_url );
        }
      }
    }
    Code (markup):
    If anybody could point me in the right direction that would be awesome.
     
    jasonxxx102, Jan 8, 2013 IP
  2. Rukbat

    Rukbat Well-Known Member

    Messages:
    2,908
    Likes Received:
    37
    Best Answers:
    51
    Trophy Points:
    125
    #2
    The right direction is any direction that's "away from your script".

    A web crawler crawls the web 24/7/365, saving everything it finds in a database. The site using that database (your search site) allows the user to search the database. Crawling the web after the user submits the request isn't practical - it can take months for the crawler to find a usable number of sites that match a particular request, and your user isn't going to submit his request and then wait a few months for the results.

    So the right direction is to write a crawler that crawls the web, not a particular subset of the web that represents a particular user request.
     
    Rukbat, Jan 9, 2013 IP
  3. ryan_uk

    ryan_uk Illustrious Member

    Messages:
    3,983
    Likes Received:
    1,022
    Best Answers:
    33
    Trophy Points:
    465
    #3
    ryan_uk, Jan 11, 2013 IP