Hi all, Here's the scenario: If you wanted to check a page to see if it was new or not (ie. it had been updated since you last checked) what PHP functions would you be looking at? How could that be done using the absolute minimal bandwidth, im really looking to get an idea of what scaling issues would be involved if I wanted to check 100's of thousands of urls an hour... Sorry it's vague, be vague in return just looking for some starting points for research into a project idea.. thx..
you could use one of the various methods to get the page(CURL, get_file_contents) and compare it to the previous time you checked File size should be enough
Sure, I know how to get the page, i like to use cURL but as i said above, what is the best way to do this with minimal bandwidth usage? I really couldnt go and grab 100's of thousands of pages an hour
If the server is sending the Last-Modified header, you can use the curl option to return the headers like so: curl_setopt($ch, CURLOPT_HEADER, TRUE); PHP: Although, it looks like it returns the header AND the content, which isn't needed. You may want to use cURL from a shell to do it instead: shell_exec ('curl -I http://www.digitalpoint.com'); PHP: as an example...
If you are checking with a PHP script that resides on same server that the files you are checking are you can do this: filemtime($filename); PHP: That will give you a UNIX timestamp of the last time the file was modified.
Instead of GET, POST, etc, use the request method designed exclusively for what you want to do.. HEAD Read more..http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.4 Excerpt: "This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification..." Regarding specifically dated/altered content, cacheing is evaluated (by intermediaries) when the HEAD request is used as well.. Excerpt: "If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, Content-MD5, ETag or Last-Modified), then the cache MUST treat the cache entry as stale." In other words, the above are established standard practices for what you want to do Cheers, JL
Personally I think you are stuck getting the text. Consider a dynamic website. The get last modified will get the age of the file which generates the page, not the age of the content (probably held in a database). Personally I can't see anyway around it. Then you need to consider how to identify how much has changed. * Is a small change acceptable? * how many bytes? * but not if the change is just in a link? must be content? Hope this helps - and good luck! Sarah