Scraping information

Discussion in 'Programming' started by The Saint, Nov 3, 2006.

  1. #1
    I'm simply trying to collect information off a wesbite and put it into an excel sheet or something simple and organized.

    Any suggestions?
     
    The Saint, Nov 3, 2006 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    This might help.
    phpclasses.org/browse/package/86.html
    web-aware.com/biff/index.htm
     
    nico_swd, Nov 3, 2006 IP
  3. streety

    streety Peon

    Messages:
    321
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    0
    #3
    A couple of good finds there for the excel side of things. In terms of actually grabbing the content in the first place once you have the file downloaded I would just use a regular expression. In PHP I would use file_get_contents for downloading and then preg_match_all for the regular expression match.
     
    streety, Nov 3, 2006 IP
  4. The Saint

    The Saint Peon

    Messages:
    340
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Thanks I've done scraping with wordpress plugins and such. But, I've never made something by hand in PHP because I'm a novice.
     
    The Saint, Nov 3, 2006 IP
  5. streety

    streety Peon

    Messages:
    321
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Hopefully this should get you going.

    <?php
    
    $page = file_get_contents("http://www.text-link-ads.com/Autos-C45/");
    
    $regex = '/<table\swidth="100%"\sborder="0"\scellpadding="0"\scellspacing="0">(.*?)<\/table>/s';
    
    preg_match_all($regex, $page, $table);
    
    
    
    
    foreach($table[0] as $entry) {
    
        
        preg_match_all('/<tr>(.*?)<\/tr>/s', $entry, $entry_tokens);
        
        
        
        foreach($entry_tokens as $line){
        
            print(strip_tags($line[0]));   //Description 
            print "<br><br>";
            print(strip_tags($line[1]));    //Number of pages ad will appear on
            print "<br><br>";
            print(strip_tags($line[2]));    //Spots available/filled
            print "<br><br>";
            print(strip_tags($line[3]));    //Cost
            print "<br><br>";
            
        }
        
        print "<br><br>";
    
    
    }
    
    
    ?>
    PHP:
    It isn't perfect but it shows you basically what needs to be done. You'll need to study the source code for the textlinkads page and then just troubleshoot using var_dump until you are extracting all the info you need.

    Any problems let me know and I'll try and help.
     
    streety, Nov 3, 2006 IP