Unique Crawler Script

jasonxxx102 Well-Known Member

Messages:: 1,114

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 105

#1

I have a basic PHP web crawler script and I need to expand its functionality, the problem is I'm a total noob at PHP and my knowledge is very basic so I'm coming here for some help.

My goal is to have a basic user input (text box) and when the user types in a phrase; let's say "Red Apples" and hits the enter button the script should start crawling the web for the phrase "Red Apples" and store the plain text results along with the URL they originated from in a database.

Here is what I've got so far:
error_reporting( E_ERROR );
 
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
 

$domains = array();

$urls = array();
 
function crawl( $url )
{
  global $domains, $urls;
 
  echo "Crawling $url... ";
 
  $parse = parse_url( $url );

  $domains[ $parse['host'] ]++;
  $urls[] = $url;
 
  $content = file_get_contents( $url );
  if ( $content === FALSE )
  {
    echo "Error.\n";
    return;
  }
 
 
  $content = stristr( $content, "body" );
  preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );
 
  echo 'Found ' . count( $matches[0] ) . " urls.\n";
 
  foreach( $matches[0] as $crawled_url )
  {
    $parse = parse_url( $crawled_url );
 
    if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN
        && !in_array( $crawled_url, $urls ) )
    {
      sleep( 1 );
      crawl( $crawled_url );
    }
  }
}
Code (markup):
If anybody could point me in the right direction that would be awesome.

jasonxxx102, Jan 8, 2013 IP

Rukbat Well-Known Member

Messages:: 2,908

Likes Received:: 37

Best Answers:: 51

Trophy Points:: 125

#2

jasonxxx102 said: ↑

If anybody could point me in the right direction
Click to expand...

The right direction is any direction that's "away from your script".

A web crawler crawls the web 24/7/365, saving everything it finds in a database. The site using that database (your search site) allows the user to search the database. Crawling the web after the user submits the request isn't practical - it can take months for the crawler to find a usable number of sites that match a particular request, and your user isn't going to submit his request and then wait a few months for the results.

So the right direction is to write a crawler that crawls the web, not a particular subset of the web that represents a particular user request.

Rukbat, Jan 9, 2013 IP

ryan_uk Illustrious Member

Messages:: 3,983

Likes Received:: 1,022

Best Answers:: 33

Trophy Points:: 465

#3

Why don't you just make a Google CSE?

ryan_uk, Jan 11, 2013 IP

Log in or Sign up

Unique Crawler Script

jasonxxx102 Well-Known Member

Rukbat Well-Known Member

ryan_uk Illustrious Member

Useful Searches