1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP crawler/spider, almost done, have some questions!

Discussion in 'PHP' started by supercharge, Jan 29, 2010.

  1. #1
    Hi!

    I'm making a php crawler/spider as a small project to learn PHP. I want to crawl this big swedish site: http://www.adlibris.com/se/ and get all the links, to see that it works as it should.

    There should be about 3.5 million books, which should have a uniqe page, so I guess the total amount of links should be 3.5-4 million or something like that.

    However my script only finds 1500 links and I get 8-10 error messages above the links.. All says the same thing: "Warning: Invalid argument supplied for foreach() in C:\xampp\xampp\htdocs\adlibris\hamtalankarlight.ph p on line 42"

    Line 42 in my code is this line: " foreach ($allinks as $value3) {"

    This is the code:
    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
    
      $snoopy->fetchlinks("http://www.adlibris.com/se/");
       // print $snoopy->results;          
    
    $links = $snoopy->results;
    //echo $links[0];//
    //print_r($links);
    
    
    $keyword = "adlibris.com/se";
    $array2 = array();
    
    foreach ($links as $value){
        
        
        
        if (strpos($value, $keyword) > 0 ) {  
        
        $array2[] = $value;
        
        
        
       } 
    }
    //print_r($array2); 
    
    foreach ($array2 as $value2){
    
        $snoopy = new Snoopy;
        $snoopy->fetchlinks($value2);
        $allinks = $snoopy->results;
        
        
        foreach ($allinks as $value3) {
        if (in_array($value3, $array2)) {
        
        } else{
        $array2[] = $value3;
        
        }
        
        }
        
    }
    
    
    print_r($array2);
    
    ?> 
    PHP:
    You can find the snoopyclass that i'm using here: http://rapidshare.com/files/34320731...class.php.html

    Du you have any ideas what the problem might be or why I get these errormessages?
     
    supercharge, Jan 29, 2010 IP
  2. supercharge

    supercharge Peon

    Messages:
    30
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    btw line 42 won't be the same in the code posted since I removed some swedish comments.

    Line 42 (the error) is, as I mentioned above, this line: " foreach ($allinks as $value3) {"
     
    supercharge, Jan 29, 2010 IP
  3. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #3
    You need to do some debugging. Please post the results from: print_r($allinks);

    This will show you how the array is constructed and what it contains.

    The mentioned array could be empty or be a staggered array with multiple child arrays causing the foreach loop to fail.

    Also, you may want to change the variable names to something more appropriate; having the variables declared with names describing what they contain (metadata) makes debugging so much easier.
     
    BRUm, Jan 30, 2010 IP
  4. Gray Fox

    Gray Fox Well-Known Member

    Messages:
    196
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    130
    #4
    I also wanted to say that, I'd recommend using camelCase

    
    foreach ($array2 as $value2){
        /* ... */
        foreach ($allinks as $value3) {
                if (in_array($value3, $array2)) {
            } else{
                $array2[] = $value3;
            }
        }
    }
    
    PHP:
    Try opening that project in 3 days and you'll see why is a "Good code its own best documentation"
     
    Gray Fox, Jan 30, 2010 IP
  5. zandigo

    zandigo Greenhorn

    Messages:
    71
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    16
    #5
    Sometimes it was not your code problem. The target server could configure to defend against intensive crawling activities. It may be able to detect abnormal activities from a single connection session or from single IP. Say, it may allow a specific number of page views within short period of time.

    But again, it could be totally your code glitch.
     
    zandigo, Jan 30, 2010 IP
  6. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #6
    Zandigo, while this may be the case for some websites I imagine the selection to be a minute fraction. I ran a blog search engine which used a webcrawler I created, unfortunately I cannot release the source (if I knew where it is!) because I sold the rights along with the project.

    Supercharge, later on today I'll knock up a small script to mimic what you need.
     
    BRUm, Jan 31, 2010 IP
  7. supercharge

    supercharge Peon

    Messages:
    30
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    would be great :)
     
    supercharge, Feb 1, 2010 IP