App to check for updates in a site?

tbarr60 Notable Member

Messages:: 3,455

Likes Received:: 125

Best Answers:: 0

Trophy Points:: 210

#1

I'd like to have a script to check a site (1000's of pages) for updates. So I'd need the ability to spider a site and then spider the same site at some time period later and have it produce a summary of the differences. Are there apps out there already or should I start coding away in one of my favorite languages?

tbarr60, Aug 8, 2007 IP

ecentricNick Peon

Messages:: 351

Likes Received:: 13

Best Answers:: 0

Trophy Points:: 0

#2

If running on linux, wget will spider the site and cache the pages.

Further more, I believe you can pass optional parameters so that it will only update the page if changed.

A simple look at the timestamps against each cached file will then tell you what's changed.

ecentricNick, Aug 9, 2007 IP

PowerExtreme Banned

Messages:: 2,118

Likes Received:: 75

Best Answers:: 0

Trophy Points:: 0

#3

this can be done by checking the size of the file i dunno if there is any such app yet

PowerExtreme, Aug 9, 2007 IP

tbarr60 Notable Member

Messages:: 3,455

Likes Received:: 125

Best Answers:: 0

Trophy Points:: 210

#4

As a point of clarification, the site I was considering doing this on is not my site, it's just a site I'd like to monitor for updates.

tbarr60, Aug 9, 2007 IP

ecentricNick Peon

Messages:: 351

Likes Received:: 13

Best Answers:: 0

Trophy Points:: 0

#5

wget will spider (and cahe) any site on the net - storing the results on your server - so, i still reckon if it were me, i'd use the method posted above.

ecentricNick, Aug 10, 2007 IP

henryb Member

Messages:: 65

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 43

#6

wget works on windows as well. So you can download the pages from time to time into different folders. And then compare them with WinDiff, for example (http://en.wikipedia.org/wiki/WinDiff). I don't know whether difference in html makes sense for you, I imagine you would want to parse some info out.

henryb, Aug 10, 2007 IP

henryb Member

Messages:: 65

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 43

#7

Alternatively you can ask somebody to write a custom spider that caches the pages, creates checksums and then downloads/analizes only pages that were changed.

henryb, Aug 10, 2007 IP

ecentricNick Peon

Messages:: 351

Likes Received:: 13

Best Answers:: 0

Trophy Points:: 0

#8

The real problem is going to be deciding whether anything really changed.

For example, let's say they include the current date on the page. Well, that's going to change every day. Or if they use a server side rotator script to insert banner images. Every time you look at the page it's changed.

So, it's probably going to get a lot more complex unless you can easily identify simple things to look for within the pages.

ecentricNick, Aug 10, 2007 IP

tbarr60 Notable Member

Messages:: 3,455

Likes Received:: 125

Best Answers:: 0

Trophy Points:: 210

#9

To accomplish what I want to do, a list of pages with changed links in a certain place in the body of the page would be sufficient. If the links changed in the main part of the page I would want to know.

tbarr60, Aug 10, 2007 IP

henryb Member

Messages:: 65

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 43

#10

I agree. You can exclude images comparing only text of the pages.
I assumed he is talking about a static website that rarely changes. If it is dynamic and inserts a new date into every page, that would be a trouble.

henryb, Aug 10, 2007 IP

ecentricNick Peon

Messages:: 351

Likes Received:: 13

Best Answers:: 0

Trophy Points:: 0

#11

tbarr60 said: ↑

To accomplish what I want to do, a list of pages with changed links in a certain place in the body of the page would be sufficient. If the links changed in the main part of the page I would want to know.
Click to expand...

In that case, all you need is a cron job, running a simple shell script, which calls wget on each page, and pipes the output through a regular expression to extract the links and then pipes that into a file would do what you want very efficiently.

If you're more comfortable with php than with unix o/s calls, then you can do it equally with fopen or curl, again parsing the result with a regular expression to extract the links. Hook it all up to a mysql database and you can easily compare to see if any links have changed (or check your link is still there, which is all you probably really want to do).

ecentricNick, Aug 10, 2007 IP

Log in or Sign up

App to check for updates in a site?

tbarr60 Notable Member

ecentricNick Peon

PowerExtreme Banned

tbarr60 Notable Member

ecentricNick Peon

henryb Member

henryb Member

ecentricNick Peon

tbarr60 Notable Member

henryb Member

ecentricNick Peon

Useful Searches