Hi Am looking for a piece of software that will do some HTML page scraping for me. I want to be able to tell this piece of software to go to site yada.com, login, grab some piece of information (like in a table cell) and then store it somewhere for use later. I need to be able to schedule this as well. Any ideas would be great, thanks J
Thanks for the warnings, i want to build something that will evey hour go and get the following: - Adsense revenue - CJ Revenue - Stats on my site - visits and views and comments - Various affiliate revenues from 3 different sites - GB traffic usage from my webhosts - Anything else I use as a measure of success Seams like every morning i wake up and flick through roboform logins for about 10 different places to find out what's going on. I only need one piece of information from most places. Legit intentions all round. If i can do this it will be something I will pass onto others, i would suggest most of us here are in the same boat, this takes up too much time, there must be a better way. Jay
I havent heard of any software to do what you are looking for. I used to go to each site I wanted to check my stats for separately and it did take a while. Now I have saved all the login pages of the sites I like to check stats for as bookmarks in firefox. They are all in the one folder, called "stats", and then I just go to that folder and open up all the pages at once with the "open in tabs" option. FF saves all my passwords, so it's just a matter of clicking login at each site. Saves a bit of time. It would be a pain having to do that every hour. I hope you find something.
Use the cURL functions to grab the necessary pages and then write a parser to extract the info you need.
Take a look at some scripts at Hotscripts dealing with web fetching http://hotscripts.com/PHP/Scripts_and_Programs/Web_Fetching/index.html but do keep in mind that some sites don't like bots getting in and I dunno if it's worth loosing your account over.
Yeah, I use Perl with the LWP & Cookies modules to do this stuff. All my screenscaping scripts start the same way: #!/usr/bin/perl -w use strict; use HTTP::Cookies; use LWP; my $browser = LWP::UserAgent->new; $browser->timeout(20); $browser->max_size(20480); $browser->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)"); my $cookie_jar = HTTP::Cookies->new(); $browser->cookie_jar($cookie_jar); Code (markup): At that point, I start slinging around the $browser->post() and $browser->get() functions. That part varies, of course. Find good docs on LWP and you should be able to get up to speed easily. -Tony
Thanks all for your comments. I found an application that does a great job of this called screen-scraper. http://www.screen-scraper.com They have a free version that can put together a script that can be run from a command prompt, they do have a pro version as well to integrate into an app however it's US$499, very expensive. I think i will persist with the free one and get it to write out to an xml and then build a dashboard to display the xml info. aboyd, does the strategy above get around the login process? Thanks again J
That's what the Cookie module is used for. It allows you to login & save the login. For example, I have some phpBB forums. I want to post in the forums 1 or 2 times a day, just some provoking news item or whatever. However, I don't want to go to the site each day and post manually. So, each weekend I read a TON of Google news, find 20 relevant stories, and then queue up 20 posts in a database. Each post has a date to go with it. Finally, I have my Perl app check the database hourly. If it sees that a post's time has come, the Perl app logs into the forum as me, and sends the post. In this way, I can be active on the site for a whole week, with only a few hours on a weekend. The Perl code does the login & posting for me. -Tony
Maybe you can consider to use FEAR::API, another site scraping framework based on Perl. The module itself is free of charge, and you can use it to create your own specific scraping scripts. See search.cpan.org/perldoc?FEAR::API Best, Yung-chung Lin
This thread reminds me of a great quote from Larry Wall, creator of the perl language. "A truly great computer programmer is lazy, impatient and full of hubris"