Hi! I'm making a php crawler/spider as a small project to learn PHP. I want to crawl this big swedish site: http://www.adlibris.com/se/ and get all the links, to see that it works as it should. There should be about 3.5 million books, which should have a uniqe page, so I guess the total amount of links should be 3.5-4 million or something like that. However my script only finds 1500 links and I get 8-10 error messages above the links.. All says the same thing: "Warning: Invalid argument supplied for foreach() in C:\xampp\xampp\htdocs\adlibris\hamtalankarlight.ph p on line 42" Line 42 in my code is this line: " foreach ($allinks as $value3) {" This is the code: <?php include "Snoopy.class.php"; $snoopy = new Snoopy; $snoopy->fetchlinks("http://www.adlibris.com/se/"); // print $snoopy->results; $links = $snoopy->results; //echo $links[0];// //print_r($links); $keyword = "adlibris.com/se"; $array2 = array(); foreach ($links as $value){ if (strpos($value, $keyword) > 0 ) { $array2[] = $value; } } //print_r($array2); foreach ($array2 as $value2){ $snoopy = new Snoopy; $snoopy->fetchlinks($value2); $allinks = $snoopy->results; foreach ($allinks as $value3) { if (in_array($value3, $array2)) { } else{ $array2[] = $value3; } } } print_r($array2); ?> PHP: You can find the snoopyclass that i'm using here: http://rapidshare.com/files/34320731...class.php.html Du you have any ideas what the problem might be or why I get these errormessages?
btw line 42 won't be the same in the code posted since I removed some swedish comments. Line 42 (the error) is, as I mentioned above, this line: " foreach ($allinks as $value3) {"
You need to do some debugging. Please post the results from: print_r($allinks); This will show you how the array is constructed and what it contains. The mentioned array could be empty or be a staggered array with multiple child arrays causing the foreach loop to fail. Also, you may want to change the variable names to something more appropriate; having the variables declared with names describing what they contain (metadata) makes debugging so much easier.
I also wanted to say that, I'd recommend using camelCase foreach ($array2 as $value2){ /* ... */ foreach ($allinks as $value3) { if (in_array($value3, $array2)) { } else{ $array2[] = $value3; } } } PHP: Try opening that project in 3 days and you'll see why is a "Good code its own best documentation"
Sometimes it was not your code problem. The target server could configure to defend against intensive crawling activities. It may be able to detect abnormal activities from a single connection session or from single IP. Say, it may allow a specific number of page views within short period of time. But again, it could be totally your code glitch.
Zandigo, while this may be the case for some websites I imagine the selection to be a minute fraction. I ran a blog search engine which used a webcrawler I created, unfortunately I cannot release the source (if I knew where it is!) because I sold the rights along with the project. Supercharge, later on today I'll knock up a small script to mimic what you need.