Open Source Scraper

unispace Peon

Messages:: 85

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

I'm looking for an open source spider.

I want to spider some goverment online databases and need a way to grab their data from html pages and insert it sytematically into a mysql database.

The way the pages are laid out are the same it's just the data that changes.

Has anyone done a project like this or know of anything that could be customized to get the job done?

TIA

unispace, Sep 10, 2007 IP

omgitsfletch Well-Known Member

Messages:: 1,222

Likes Received:: 44

Best Answers:: 0

Trophy Points:: 145

#2

You are going to have to write some regular expressions that will be able to split data from the crap surrounding it. Regular expressions are NOT an easy thing to learn, so you're best bet is probably posting in the marketplace to find a programmer to help you.

omgitsfletch, Sep 10, 2007 IP

ErectADirectory Guest

Messages:: 656

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 0

#3

I hate to quote myself but I posted a simple scraper a few weeks ago. PM if you need me to set this up to dump data into your db.
ErectADirectory said: ↑
... deleted a brief intro about cURL ...
function file_get_the_contents($url) {
  $ch = curl_init();
  $timeout = 10; // set to zero for no timeout
  curl_setopt ($ch, CURLOPT_URL, $url);
  curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  $file_contents = curl_exec($ch);
  curl_close($ch);
  return $file_contents;
}

// now start your data harvesting
$text = file_get_the_contents('http://www.mypage.com/') ; // scrape page into variable
preg_match ("/([^`]*?)/", $text, $temp); // get data out of the page
echo htmlentities($temp[0]) ; // spits out the 1st occurance of your data
PHP:
Click to expand...
This is a very simple machine. I'm sure someone can point you to a very elegant solution but I'm not sure a "one size fits all" scraper is available.

Good luck

ErectADirectory, Sep 10, 2007 IP

Such Great Heights likes this.

unispace Peon

Messages:: 85

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

This is perfect, thanks E. You saved me at least a couple of hours.

I haven't done much scraping before but I've done quite a bit of parsing files so this will do just fine.

Obviously I will have to write so code to insert into a db but nothing too difficult.

unispace, Sep 10, 2007 IP

mrspeed Guest

Messages:: 193

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

Just curious but what's the advantage to using curl versus fopen() ?

mrspeed, Sep 10, 2007 IP

stats Well-Known Member

Messages:: 586

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 110

#6

I have just completed such a project to scrap pages and systematically insert the info into my DB .. such as product price goes to my price column, weight goes to weight column, description goes to description column, etc

each website requires its unique way of doing it .. the most pain is deciding on begining and ending of the content you want .. it's not that easy sometimes and even harder if the source site is not well-organized ..

I have the universal function that gets you the content when you give the begining and ending texts. It is not based on regex (because if it was - then you'll get a triple headake ..)

stats, Sep 10, 2007 IP

ErectADirectory Guest

Messages:: 656

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 0

#7

mrspeed said: ↑

Just curious but what's the advantage to using curl versus fopen() ?
Click to expand...

Several things that come to mind and I'm far from a cURL expert. Only the 1st has any bearings here.

1. fopen() is disabled on many servers
2. cURL allows you to spoof the useragent (ever want to surf as googlebot?)
3. cURL allows you to simulate a $_POSTed form
4. cURL will read and follow redirects (301, etc)

I barely skim the surface of what all is possible with cURL but I'm sure these guys will give you a better idea of what all can happen.

I also hear it's faster than fopen or file_get_contents though I have never benchmarked it. Anyone else have some experience with it they'd like to share?

ErectADirectory, Sep 10, 2007 IP

mrspeed Guest

Messages:: 193

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#8

Thanks for the reply.

A few years ago I wrote a scraper in ASP. Writing the regex or pattern matches can be a real pain. I seem to remember removing carriage returns and line feeds before trying to parse it.

The code above may come in handy for me. I have a large classic ASP site that I need to convert to a CMS system and I hate the thought of doing it by hand. At least I have a sitemap for the site so I really don't need to crawl.

mrspeed, Sep 10, 2007 IP

stats Well-Known Member

Messages:: 586

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 110

#9

your're on the right way ..
your best bet would be to start with the following ..

1. $page = file_get_contents(http://sourcesite.com) // get content of source page into a variable
2. $page = preg_replace("^\s^", " ", $page); // replace every white space with space symbol " " so that you don't kill yourself on finding the tabs and returns

the rest is a unique work for each website

good luck on it .. i was on same job a couple weeks ago, and know that .. but once you get it done - you feel proud

mrspeed said: ↑

Thanks for the reply.

A few years ago I wrote a scraper in ASP. Writing the regex or pattern matches can be a real pain. I seem to remember removing carriage returns and line feeds before trying to parse it.

The code above may come in handy for me. I have a large classic ASP site that I need to convert to a CMS system and I hate the thought of doing it by hand. At least I have a sitemap for the site so I really don't need to crawl.
Click to expand...

stats, Sep 11, 2007 IP

tripy Guest

Messages:: 32

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#10

If you are on PHP, and especially with a php5, you could use the dom functions to get the structure of the page like an XML file and navigate in it.
This would be much more easier than regexp...
<?php
$html =<<<EOT
<ul>
  <ul id="section1">
   <li name="param1">value1</li>
   <li name="param2">value2</li>
  </ul>
  <ul id="section2">
   <li name="param3">value3</li>
  </ul>
</ul>
EOT;

$dom = new DomDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML($html);
$aryLi = $dom->getElementsByTagName('li');

foreach ($aryLi as $key=>$li) {
       echo $li -> getAttribute('name').'<br>';
}
?>
PHP:
Expected result:
--------------
param1
param2
param3

tripy, Sep 11, 2007 IP

ErectADirectory likes this.

ErectADirectory Guest

Messages:: 656

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 0

#11

Tripy,

Great post, I'll give this a try next time I need to scrape. It seems like a great solution when a document is well formatted. I'm afraid you would still need a regex (or explode() in this example) for most scraping needs as, for example, most addresses are laid out like ...
<address>Company Name</br>
123 Mystreet Ave.</br>
Oakland, CA 12345</address>
HTML:
A question for you (so I don't have to dig around php.net for the answer ... lazy me).

Using your above example, is there a simple way to use the dom functions to get an expected result of

value1
value2
value3

ErectADirectory, Sep 11, 2007 IP

tripy Guest

Messages:: 32

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#12

As a side not, to your not well formatted content, there is a function normalizeDocument(), and you could test it.
Don't know how it would react though...
php.net/manual/en/function.dom-domdocument-normalizedocument.php

And as for the elements value:
echo $li->nodeValue;
PHP:

tripy, Sep 11, 2007 IP

Log in or Sign up

Open Source Scraper

unispace Peon

omgitsfletch Well-Known Member

ErectADirectory Guest

unispace Peon

mrspeed Guest

stats Well-Known Member

ErectADirectory Guest

mrspeed Guest

stats Well-Known Member

tripy Guest

ErectADirectory Guest

tripy Guest

Useful Searches