Hi, I have about 25000 external links on my community website resource database; how easy it is to create a crawler to automatically detect the dead links and report them back to admin for manual removal, and, How to create and display cache for all these external links, in case the external website is down. Just like Google has. If you can refer me to some free/cheap software or literature available, it would be great. Thanks in advance! Ins
crawling the links is not rocket science, all you have to do is use following code. If your fopen fails then it usually means that site is down, if you are able to open it then it means the site is live. Log this action to your db and report it to admin for removal if site not active! Simple.. isn't it! For cache, you will have to grab the contents and store it to your local server! Concept wise things are easy but the problems crop up when you start to go large-scale!
You can use a function like this to download a remote file to save for caching... http://damonparker.org/blog/2005/09/29/download-a-remote-file-using-php/
You cannot really do this in good conditions using PHP.. What you need is a perl script that would go trough records and using wget or curl try to download the file with a timeout of few seconds... If it takes more than 10 seconds to get the file or returns error then it is not valid... Php does not seems the right choice for this in my opinion