App to check for updates in a site?

Discussion in 'Programming' started by tbarr60, Aug 8, 2007.

  1. #1
    I'd like to have a script to check a site (1000's of pages) for updates. So I'd need the ability to spider a site and then spider the same site at some time period later and have it produce a summary of the differences. Are there apps out there already or should I start coding away in one of my favorite languages?
     
    tbarr60, Aug 8, 2007 IP
  2. ecentricNick

    ecentricNick Peon

    Messages:
    351
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #2
    If running on linux, wget will spider the site and cache the pages.

    Further more, I believe you can pass optional parameters so that it will only update the page if changed.

    A simple look at the timestamps against each cached file will then tell you what's changed.
     
    ecentricNick, Aug 9, 2007 IP
  3. PowerExtreme

    PowerExtreme Banned

    Messages:
    2,118
    Likes Received:
    75
    Best Answers:
    0
    Trophy Points:
    0
    #3
    this can be done by checking the size of the file i dunno if there is any such app yet
     
    PowerExtreme, Aug 9, 2007 IP
  4. tbarr60

    tbarr60 Notable Member

    Messages:
    3,455
    Likes Received:
    125
    Best Answers:
    0
    Trophy Points:
    210
    #4
    As a point of clarification, the site I was considering doing this on is not my site, it's just a site I'd like to monitor for updates.
     
    tbarr60, Aug 9, 2007 IP
  5. ecentricNick

    ecentricNick Peon

    Messages:
    351
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #5
    wget will spider (and cahe) any site on the net - storing the results on your server - so, i still reckon if it were me, i'd use the method posted above.
     
    ecentricNick, Aug 10, 2007 IP
  6. henryb

    henryb Member

    Messages:
    65
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    43
    #6
    wget works on windows as well. So you can download the pages from time to time into different folders. And then compare them with WinDiff, for example (http://en.wikipedia.org/wiki/WinDiff). I don't know whether difference in html makes sense for you, I imagine you would want to parse some info out.
     
    henryb, Aug 10, 2007 IP
  7. henryb

    henryb Member

    Messages:
    65
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    43
    #7
    Alternatively you can ask somebody to write a custom spider that caches the pages, creates checksums and then downloads/analizes only pages that were changed.
     
    henryb, Aug 10, 2007 IP
  8. ecentricNick

    ecentricNick Peon

    Messages:
    351
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #8
    The real problem is going to be deciding whether anything really changed.

    For example, let's say they include the current date on the page. Well, that's going to change every day. Or if they use a server side rotator script to insert banner images. Every time you look at the page it's changed.

    So, it's probably going to get a lot more complex unless you can easily identify simple things to look for within the pages.
     
    ecentricNick, Aug 10, 2007 IP
  9. tbarr60

    tbarr60 Notable Member

    Messages:
    3,455
    Likes Received:
    125
    Best Answers:
    0
    Trophy Points:
    210
    #9
    To accomplish what I want to do, a list of pages with changed links in a certain place in the body of the page would be sufficient. If the links changed in the main part of the page I would want to know.
     
    tbarr60, Aug 10, 2007 IP
  10. henryb

    henryb Member

    Messages:
    65
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    43
    #10
    I agree. You can exclude images comparing only text of the pages.
    I assumed he is talking about a static website that rarely changes. If it is dynamic and inserts a new date into every page, that would be a trouble.
     
    henryb, Aug 10, 2007 IP
  11. ecentricNick

    ecentricNick Peon

    Messages:
    351
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #11
    In that case, all you need is a cron job, running a simple shell script, which calls wget on each page, and pipes the output through a regular expression to extract the links and then pipes that into a file would do what you want very efficiently.

    If you're more comfortable with php than with unix o/s calls, then you can do it equally with fopen or curl, again parsing the result with a regular expression to extract the links. Hook it all up to a mysql database and you can easily compare to see if any links have changed (or check your link is still there, which is all you probably really want to do).
     
    ecentricNick, Aug 10, 2007 IP