Hello I am trying to extract data from a specific web-page (say for an this page, go and find the price currently of the product - http://www.ocado.com/webshop/product/Ocado-Smoked-Salmon-Long-Sliced/59534011? ) I am trying to find out the best way to do this. 1) Imacros extension on Firefox - that allows for some script to be run on firefox, does not require a server 2) PERL - I dont exactly know how to do it, but I think requires a server 3) PHP - I dont exactly know how to do it, but I think requires a server Would be glad if someone comments on these methods and if there's anything more easy, and guides me in the right direction Many thanks
It depends on which language you know best. For me, the best way is using Regular expression (preg_match) in PHP to extract the content.
I prefer Perl (using WWW::Mechanize and an HTML Extractor library) to scrape and crawl.... With PHP you can use cURL to crawl and the DOM to scrape...
You can use any programming language. PHP and perl does not require a server, you can run them locally. It is a good idea, to use some libs to parse content. For example, http://simplehtmldom.sourceforge.net/ Then solution to your problem will be like this: (php+simple_html_dom) <?php require_once('simple_html_dom.php'); $html = file_get_html('http://www.ocado.com/webshop/product/Ocado-Smoked-Salmon-Long-Sliced/59534011?'); $was_price = html_entity_decode(trim($html->find('span.wasPrice', 0)->plaintext), ENT_NOQUOTES, 'UTF-8'); $price = html_entity_decode(trim($html->find('span.nowPrice', 0)->plaintext), ENT_NOQUOTES, 'UTF-8'); print "Old price $was_price\n"; print "New price $price\n"; ?> PHP:
I use PHP file_get_contents and then preg_match_all to extract data from certain parts of the page $url = file_get_contents("http://example.com); preg_match_all('%<span class="item">(.*?)</span>%sim', $url, $result); PHP: