I have a basic PHP web crawler script and I need to expand its functionality, the problem is I'm a total noob at PHP and my knowledge is very basic so I'm coming here for some help. My goal is to have a basic user input (text box) and when the user types in a phrase; let's say "Red Apples" and hits the enter button the script should start crawling the web for the phrase "Red Apples" and store the plain text results along with the URL they originated from in a database. Here is what I've got so far: error_reporting( E_ERROR ); define( "CRAWL_LIMIT_PER_DOMAIN", 50 ); $domains = array(); $urls = array(); function crawl( $url ) { global $domains, $urls; echo "Crawling $url... "; $parse = parse_url( $url ); $domains[ $parse['host'] ]++; $urls[] = $url; $content = file_get_contents( $url ); if ( $content === FALSE ) { echo "Error.\n"; return; } $content = stristr( $content, "body" ); preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches ); echo 'Found ' . count( $matches[0] ) . " urls.\n"; foreach( $matches[0] as $crawled_url ) { $parse = parse_url( $crawled_url ); if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) { sleep( 1 ); crawl( $crawled_url ); } } } Code (markup): If anybody could point me in the right direction that would be awesome.
The right direction is any direction that's "away from your script". A web crawler crawls the web 24/7/365, saving everything it finds in a database. The site using that database (your search site) allows the user to search the database. Crawling the web after the user submits the request isn't practical - it can take months for the crawler to find a usable number of sites that match a particular request, and your user isn't going to submit his request and then wait a few months for the results. So the right direction is to write a crawler that crawls the web, not a particular subset of the web that represents a particular user request.