Page scraper

Discussion in 'Programming' started by jaymcc, Feb 2, 2006.

  1. #1
    Hi

    Am looking for a piece of software that will do some HTML page scraping for me.

    I want to be able to tell this piece of software to go to site yada.com, login, grab some piece of information (like in a table cell) and then store it somewhere for use later.

    I need to be able to schedule this as well.

    Any ideas would be great, thanks

    J
     
    jaymcc, Feb 2, 2006 IP
  2. l234244

    l234244 Peon

    Messages:
    1,225
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Expect to get a lot of lawsuits
     
    l234244, Feb 2, 2006 IP
  3. jaymcc

    jaymcc Peon

    Messages:
    139
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks for the warnings, i want to build something that will evey hour go and get the following:
    - Adsense revenue
    - CJ Revenue
    - Stats on my site - visits and views and comments
    - Various affiliate revenues from 3 different sites
    - GB traffic usage from my webhosts
    - Anything else I use as a measure of success

    Seams like every morning i wake up and flick through roboform logins for about 10 different places to find out what's going on. I only need one piece of information from most places.

    Legit intentions all round.

    If i can do this it will be something I will pass onto others, i would suggest most of us here are in the same boat, this takes up too much time, there must be a better way.

    Jay
     
    jaymcc, Feb 2, 2006 IP
  4. ing

    ing Well-Known Member

    Messages:
    500
    Likes Received:
    38
    Best Answers:
    0
    Trophy Points:
    195
    #4
    I havent heard of any software to do what you are looking for.

    I used to go to each site I wanted to check my stats for separately and it did take a while. Now I have saved all the login pages of the sites I like to check stats for as bookmarks in firefox. They are all in the one folder, called "stats", and then I just go to that folder and open up all the pages at once with the "open in tabs" option. FF saves all my passwords, so it's just a matter of clicking login at each site. Saves a bit of time.

    It would be a pain having to do that every hour. I hope you find something.

    :)
     
    ing, Feb 2, 2006 IP
  5. neterslandreau

    neterslandreau Peon

    Messages:
    279
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Use the cURL functions to grab the necessary pages and then write a parser to extract the info you need.
     
    neterslandreau, Feb 2, 2006 IP
  6. pwaring

    pwaring Well-Known Member

    Messages:
    846
    Likes Received:
    25
    Best Answers:
    0
    Trophy Points:
    135
    #6
    You could probably do this with libwww in Perl and some judicious use of regular expressions.
     
    pwaring, Feb 2, 2006 IP
  7. Edynas

    Edynas Peon

    Messages:
    796
    Likes Received:
    24
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Edynas, Feb 3, 2006 IP
  8. jaymcc

    jaymcc Peon

    Messages:
    139
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Hadn't considered this, thanks for the warning.
     
    jaymcc, Feb 3, 2006 IP
  9. aboyd

    aboyd Well-Known Member

    Messages:
    158
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    138
    #9
    Yeah, I use Perl with the LWP & Cookies modules to do this stuff. All my screenscaping scripts start the same way:

    #!/usr/bin/perl -w
    
    use strict;
    use HTTP::Cookies;
    use LWP;
    my $browser = LWP::UserAgent->new;
    $browser->timeout(20);
    $browser->max_size(20480);
    $browser->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)");
    my $cookie_jar = HTTP::Cookies->new();
    $browser->cookie_jar($cookie_jar);
    Code (markup):
    At that point, I start slinging around the $browser->post() and $browser->get() functions. That part varies, of course. Find good docs on LWP and you should be able to get up to speed easily.

    -Tony
     
    aboyd, Feb 4, 2006 IP
  10. jaymcc

    jaymcc Peon

    Messages:
    139
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Thanks all for your comments. I found an application that does a great job of this called screen-scraper. http://www.screen-scraper.com

    They have a free version that can put together a script that can be run from a command prompt, they do have a pro version as well to integrate into an app however it's US$499, very expensive.

    I think i will persist with the free one and get it to write out to an xml and then build a dashboard to display the xml info.

    aboyd, does the strategy above get around the login process?

    Thanks again

    J
     
    jaymcc, Feb 4, 2006 IP
  11. aboyd

    aboyd Well-Known Member

    Messages:
    158
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    138
    #11
    That's what the Cookie module is used for. It allows you to login & save the login.

    For example, I have some phpBB forums. I want to post in the forums 1 or 2 times a day, just some provoking news item or whatever. However, I don't want to go to the site each day and post manually. So, each weekend I read a TON of Google news, find 20 relevant stories, and then queue up 20 posts in a database. Each post has a date to go with it. Finally, I have my Perl app check the database hourly. If it sees that a post's time has come, the Perl app logs into the forum as me, and sends the post.

    In this way, I can be active on the site for a whole week, with only a few hours on a weekend. The Perl code does the login & posting for me.

    -Tony
     
    aboyd, Feb 4, 2006 IP
  12. xern

    xern Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Maybe you can consider to use FEAR::API, another site scraping framework based on Perl. The module itself is free of charge, and you can use it to create your own specific scraping scripts.

    See search.cpan.org/perldoc?FEAR::API

    Best,
    Yung-chung Lin
     
    xern, May 16, 2006 IP
  13. saneinsight

    saneinsight Guest

    Messages:
    159
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #13
    This thread reminds me of a great quote from Larry Wall, creator of the perl language.

    "A truly great computer programmer is lazy, impatient and full of hubris"
     
    saneinsight, Nov 27, 2006 IP