I'm working on a simple search engine, but I am having a problem with the scraper I wrote. All I want to do is get the page using file_get_contents and insert the title content into my database. The problem is, since its scraping so many pages the script is timing out or not scraping properly. Heres my code: <?php set_time_limit(0); $dbhost = "localhost"; $dbuser = "*****"; $dbpass = "*****"; $dbname = "*****"; mysql_connect($dbhost, $dbuser, $dbpass); mysql_select_db($dbname); function get ($a,$b,$c) { $y = explode($b,$a); $x = explode($c,$y[1]); return $x[0]; } for ($i = 1; $i <= 1000000; $i++) { $content = file_get_contents("http://www.website.com/page.php?id=$i"); $title = get($content, "<title>", "</title>"); mysql_query("INSERT INTO spider (title) VALUES ('$title')"); } ?> Code (markup): Anyone?
The runtime on that is going to be nuts In either case, why not use a regular expression . . . off the top of my head something like: $title = preg_match("/<title>.*<\/title>/", $content, $matches); PHP: I'm sure the regex could be improved but you get the idea. Also, what do you mean when you say "Scrapper just isn't working" . . . If you mean it's not grabbing the title, i'm going to be it's that get function you wrote. You may also consider upping the script time limit . . . looping through that many times then calling another function is gonna take a while.
What I do is have the PHP file do a header redirect to itself so you don't get a server time out. If you do this though, you have to disable Firefox's redirect loop security in about:config.
I'm a little late, but there is a case sensitive delimiter in regular expressions. If my memory doesn't fail me i think it's just an "i". But I do believe the code should work as provided.