I'm looking for an open source spider. I want to spider some goverment online databases and need a way to grab their data from html pages and insert it sytematically into a mysql database. The way the pages are laid out are the same it's just the data that changes. Has anyone done a project like this or know of anything that could be customized to get the job done? TIA
You are going to have to write some regular expressions that will be able to split data from the crap surrounding it. Regular expressions are NOT an easy thing to learn, so you're best bet is probably posting in the marketplace to find a programmer to help you.
I hate to quote myself but I posted a simple scraper a few weeks ago. PM if you need me to set this up to dump data into your db. This is a very simple machine. I'm sure someone can point you to a very elegant solution but I'm not sure a "one size fits all" scraper is available. Good luck
This is perfect, thanks E. You saved me at least a couple of hours. I haven't done much scraping before but I've done quite a bit of parsing files so this will do just fine. Obviously I will have to write so code to insert into a db but nothing too difficult.
I have just completed such a project to scrap pages and systematically insert the info into my DB .. such as product price goes to my price column, weight goes to weight column, description goes to description column, etc each website requires its unique way of doing it .. the most pain is deciding on begining and ending of the content you want .. it's not that easy sometimes and even harder if the source site is not well-organized .. I have the universal function that gets you the content when you give the begining and ending texts. It is not based on regex (because if it was - then you'll get a triple headake ..)
Several things that come to mind and I'm far from a cURL expert. Only the 1st has any bearings here. 1. fopen() is disabled on many servers 2. cURL allows you to spoof the useragent (ever want to surf as googlebot?) 3. cURL allows you to simulate a $_POSTed form 4. cURL will read and follow redirects (301, etc) I barely skim the surface of what all is possible with cURL but I'm sure these guys will give you a better idea of what all can happen. I also hear it's faster than fopen or file_get_contents though I have never benchmarked it. Anyone else have some experience with it they'd like to share?
Thanks for the reply. A few years ago I wrote a scraper in ASP. Writing the regex or pattern matches can be a real pain. I seem to remember removing carriage returns and line feeds before trying to parse it. The code above may come in handy for me. I have a large classic ASP site that I need to convert to a CMS system and I hate the thought of doing it by hand. At least I have a sitemap for the site so I really don't need to crawl.
your're on the right way .. your best bet would be to start with the following .. 1. $page = file_get_contents(http://sourcesite.com) // get content of source page into a variable 2. $page = preg_replace("^\s^", " ", $page); // replace every white space with space symbol " " so that you don't kill yourself on finding the tabs and returns the rest is a unique work for each website good luck on it .. i was on same job a couple weeks ago, and know that .. but once you get it done - you feel proud
If you are on PHP, and especially with a php5, you could use the dom functions to get the structure of the page like an XML file and navigate in it. This would be much more easier than regexp... <?php $html =<<<EOT <ul> <ul id="section1"> <li name="param1">value1</li> <li name="param2">value2</li> </ul> <ul id="section2"> <li name="param3">value3</li> </ul> </ul> EOT; $dom = new DomDocument; $dom->preserveWhiteSpace = FALSE; $dom->loadHTML($html); $aryLi = $dom->getElementsByTagName('li'); foreach ($aryLi as $key=>$li) { echo $li -> getAttribute('name').'<br>'; } ?> PHP: Expected result: -------------- param1 param2 param3
Tripy, Great post, I'll give this a try next time I need to scrape. It seems like a great solution when a document is well formatted. I'm afraid you would still need a regex (or explode() in this example) for most scraping needs as, for example, most addresses are laid out like ... <address>Company Name</br> 123 Mystreet Ave.</br> Oakland, CA 12345</address> HTML: A question for you (so I don't have to dig around php.net for the answer ... lazy me). Using your above example, is there a simple way to use the dom functions to get an expected result of value1 value2 value3
As a side not, to your not well formatted content, there is a function normalizeDocument(), and you could test it. Don't know how it would react though... php.net/manual/en/function.dom-domdocument-normalizedocument.php And as for the elements value: echo $li->nodeValue; PHP: