I'd like to have a script to check a site (1000's of pages) for updates. So I'd need the ability to spider a site and then spider the same site at some time period later and have it produce a summary of the differences. Are there apps out there already or should I start coding away in one of my favorite languages?
If running on linux, wget will spider the site and cache the pages. Further more, I believe you can pass optional parameters so that it will only update the page if changed. A simple look at the timestamps against each cached file will then tell you what's changed.
As a point of clarification, the site I was considering doing this on is not my site, it's just a site I'd like to monitor for updates.
wget will spider (and cahe) any site on the net - storing the results on your server - so, i still reckon if it were me, i'd use the method posted above.
wget works on windows as well. So you can download the pages from time to time into different folders. And then compare them with WinDiff, for example (http://en.wikipedia.org/wiki/WinDiff). I don't know whether difference in html makes sense for you, I imagine you would want to parse some info out.
Alternatively you can ask somebody to write a custom spider that caches the pages, creates checksums and then downloads/analizes only pages that were changed.
The real problem is going to be deciding whether anything really changed. For example, let's say they include the current date on the page. Well, that's going to change every day. Or if they use a server side rotator script to insert banner images. Every time you look at the page it's changed. So, it's probably going to get a lot more complex unless you can easily identify simple things to look for within the pages.
To accomplish what I want to do, a list of pages with changed links in a certain place in the body of the page would be sufficient. If the links changed in the main part of the page I would want to know.
I agree. You can exclude images comparing only text of the pages. I assumed he is talking about a static website that rarely changes. If it is dynamic and inserts a new date into every page, that would be a trouble.
In that case, all you need is a cron job, running a simple shell script, which calls wget on each page, and pipes the output through a regular expression to extract the links and then pipes that into a file would do what you want very efficiently. If you're more comfortable with php than with unix o/s calls, then you can do it equally with fopen or curl, again parsing the result with a regular expression to extract the links. Hook it all up to a mysql database and you can easily compare to see if any links have changed (or check your link is still there, which is all you probably really want to do).