Open Source Scraper

Discussion in 'PHP' started by unispace, Sep 10, 2007.

  1. #1
    I'm looking for an open source spider.

    I want to spider some goverment online databases and need a way to grab their data from html pages and insert it sytematically into a mysql database.

    The way the pages are laid out are the same it's just the data that changes.

    Has anyone done a project like this or know of anything that could be customized to get the job done?

    TIA
     
    unispace, Sep 10, 2007 IP
  2. omgitsfletch

    omgitsfletch Well-Known Member

    Messages:
    1,222
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    145
    #2
    You are going to have to write some regular expressions that will be able to split data from the crap surrounding it. Regular expressions are NOT an easy thing to learn, so you're best bet is probably posting in the marketplace to find a programmer to help you.
     
    omgitsfletch, Sep 10, 2007 IP
  3. ErectADirectory

    ErectADirectory Guest

    Messages:
    656
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I hate to quote myself but I posted a simple scraper a few weeks ago. PM if you need me to set this up to dump data into your db.

    This is a very simple machine. I'm sure someone can point you to a very elegant solution but I'm not sure a "one size fits all" scraper is available.

    Good luck
     
    ErectADirectory, Sep 10, 2007 IP
    Such Great Heights likes this.
  4. unispace

    unispace Peon

    Messages:
    85
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    This is perfect, thanks E. You saved me at least a couple of hours.

    I haven't done much scraping before but I've done quite a bit of parsing files so this will do just fine.

    Obviously I will have to write so code to insert into a db but nothing too difficult.
     
    unispace, Sep 10, 2007 IP
  5. mrspeed

    mrspeed Guest

    Messages:
    193
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Just curious but what's the advantage to using curl versus fopen() ?
     
    mrspeed, Sep 10, 2007 IP
  6. stats

    stats Well-Known Member

    Messages:
    586
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    110
    #6
    I have just completed such a project to scrap pages and systematically insert the info into my DB .. such as product price goes to my price column, weight goes to weight column, description goes to description column, etc

    each website requires its unique way of doing it .. the most pain is deciding on begining and ending of the content you want .. it's not that easy sometimes and even harder if the source site is not well-organized ..

    I have the universal function that gets you the content when you give the begining and ending texts. It is not based on regex (because if it was - then you'll get a triple headake ..)
     
    stats, Sep 10, 2007 IP
  7. ErectADirectory

    ErectADirectory Guest

    Messages:
    656
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Several things that come to mind and I'm far from a cURL expert. Only the 1st has any bearings here.

    1. fopen() is disabled on many servers
    2. cURL allows you to spoof the useragent (ever want to surf as googlebot?)
    3. cURL allows you to simulate a $_POSTed form
    4. cURL will read and follow redirects (301, etc)

    I barely skim the surface of what all is possible with cURL but I'm sure these guys will give you a better idea of what all can happen.

    I also hear it's faster than fopen or file_get_contents though I have never benchmarked it. Anyone else have some experience with it they'd like to share?
     
    ErectADirectory, Sep 10, 2007 IP
  8. mrspeed

    mrspeed Guest

    Messages:
    193
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks for the reply.

    A few years ago I wrote a scraper in ASP. Writing the regex or pattern matches can be a real pain. I seem to remember removing carriage returns and line feeds before trying to parse it.

    The code above may come in handy for me. I have a large classic ASP site that I need to convert to a CMS system and I hate the thought of doing it by hand. At least I have a sitemap for the site so I really don't need to crawl.
     
    mrspeed, Sep 10, 2007 IP
  9. stats

    stats Well-Known Member

    Messages:
    586
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    110
    #9
    your're on the right way ..
    your best bet would be to start with the following ..

    1. $page = file_get_contents(http://sourcesite.com) // get content of source page into a variable
    2. $page = preg_replace("^\s^", " ", $page); // replace every white space with space symbol " " so that you don't kill yourself on finding the tabs and returns

    the rest is a unique work for each website

    good luck on it .. i was on same job a couple weeks ago, and know that .. but once you get it done - you feel proud :)


     
    stats, Sep 11, 2007 IP
  10. tripy

    tripy Guest

    Messages:
    32
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #10
    If you are on PHP, and especially with a php5, you could use the dom functions to get the structure of the page like an XML file and navigate in it.
    This would be much more easier than regexp...
    
    <?php
    $html =<<<EOT
    <ul>
      <ul id="section1">
       <li name="param1">value1</li>
       <li name="param2">value2</li>
      </ul>
      <ul id="section2">
       <li name="param3">value3</li>
      </ul>
    </ul>
    EOT;
    
    $dom = new DomDocument;
    $dom->preserveWhiteSpace = FALSE;
    $dom->loadHTML($html);
    $aryLi = $dom->getElementsByTagName('li');
    
    foreach ($aryLi as $key=>$li) {
           echo $li -> getAttribute('name').'<br>';
    }
    ?>
    
    PHP:
    Expected result:
    --------------
    param1
    param2
    param3
     
    tripy, Sep 11, 2007 IP
    ErectADirectory likes this.
  11. ErectADirectory

    ErectADirectory Guest

    Messages:
    656
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Tripy,

    Great post, I'll give this a try next time I need to scrape. It seems like a great solution when a document is well formatted. I'm afraid you would still need a regex (or explode() in this example) for most scraping needs as, for example, most addresses are laid out like ...

    <address>Company Name</br>
    123 Mystreet Ave.</br>
    Oakland, CA 12345</address>
    HTML:
    A question for you (so I don't have to dig around php.net for the answer ... lazy me).

    Using your above example, is there a simple way to use the dom functions to get an expected result of

    value1
    value2
    value3
     
    ErectADirectory, Sep 11, 2007 IP
  12. tripy

    tripy Guest

    Messages:
    32
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #12
    As a side not, to your not well formatted content, there is a function normalizeDocument(), and you could test it.
    Don't know how it would react though...
    php.net/manual/en/function.dom-domdocument-normalizedocument.php

    And as for the elements value:
    
    echo $li->nodeValue;
    
    PHP:
     
    tripy, Sep 11, 2007 IP