PHP crawler/spider, almost done, have some questions!

Discussion in 'PHP' started by supercharge, Jan 29, 2010.

  1. #1
    Hi!

    I'm making a php crawler/spider as a small project to learn PHP. I want to crawl this big swedish site: http://www.adlibris.com/se/ and get all the links, to see that it works as it should.

    There should be about 3.5 million books, which should have a uniqe page, so I guess the total amount of links should be 3.5-4 million or something like that.

    However my script only finds 1500 links and I get 8-10 error messages above the links.. All says the same thing: "Warning: Invalid argument supplied for foreach() in C:\xampp\xampp\htdocs\adlibris\hamtalankarlight.ph p on line 42"

    Line 42 in my code is this line: " foreach ($allinks as $value3) {"

    This is the code:
    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
    
      $snoopy->fetchlinks("http://www.adlibris.com/se/");
       // print $snoopy->results;          
    
    $links = $snoopy->results;
    //echo $links[0];//
    //print_r($links);
    
    
    $keyword = "adlibris.com/se";
    $array2 = array();
    
    foreach ($links as $value){
        
        
        
        if (strpos($value, $keyword) > 0 ) {  
        
        $array2[] = $value;
        
        
        
       } 
    }
    //print_r($array2); 
    
    foreach ($array2 as $value2){
    
        $snoopy = new Snoopy;
        $snoopy->fetchlinks($value2);
        $allinks = $snoopy->results;
        
        
        foreach ($allinks as $value3) {
        if (in_array($value3, $array2)) {
        
        } else{
        $array2[] = $value3;
        
        }
        
        }
        
    }
    
    
    print_r($array2);
    
    ?> 
    PHP:
    You can find the snoopyclass that i'm using here: http://rapidshare.com/files/34320731...class.php.html

    Du you have any ideas what the problem might be or why I get these errormessages?
     
    supercharge, Jan 29, 2010 IP
  2. supercharge

    supercharge Peon

    Messages:
    30
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    btw line 42 won't be the same in the code posted since I removed some swedish comments.

    Line 42 (the error) is, as I mentioned above, this line: " foreach ($allinks as $value3) {"
     
    supercharge, Jan 29, 2010 IP
  3. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #3
    You need to do some debugging. Please post the results from: print_r($allinks);

    This will show you how the array is constructed and what it contains.

    The mentioned array could be empty or be a staggered array with multiple child arrays causing the foreach loop to fail.

    Also, you may want to change the variable names to something more appropriate; having the variables declared with names describing what they contain (metadata) makes debugging so much easier.
     
    BRUm, Jan 30, 2010 IP
  4. Gray Fox

    Gray Fox Well-Known Member

    Messages:
    196
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    130
    #4
    I also wanted to say that, I'd recommend using camelCase

    
    foreach ($array2 as $value2){
        /* ... */
        foreach ($allinks as $value3) {
                if (in_array($value3, $array2)) {
            } else{
                $array2[] = $value3;
            }
        }
    }
    
    PHP:
    Try opening that project in 3 days and you'll see why is a "Good code its own best documentation"
     
    Gray Fox, Jan 30, 2010 IP
  5. zandigo

    zandigo Greenhorn

    Messages:
    71
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    16
    #5
    Sometimes it was not your code problem. The target server could configure to defend against intensive crawling activities. It may be able to detect abnormal activities from a single connection session or from single IP. Say, it may allow a specific number of page views within short period of time.

    But again, it could be totally your code glitch.
     
    zandigo, Jan 30, 2010 IP
  6. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #6
    Zandigo, while this may be the case for some websites I imagine the selection to be a minute fraction. I ran a blog search engine which used a webcrawler I created, unfortunately I cannot release the source (if I knew where it is!) because I sold the rights along with the project.

    Supercharge, later on today I'll knock up a small script to mimic what you need.
     
    BRUm, Jan 31, 2010 IP
  7. supercharge

    supercharge Peon

    Messages:
    30
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    would be great :)
     
    supercharge, Feb 1, 2010 IP