Hi People........ I am looking for a bit of opensource PHP that will act as a spider for a project I am working on. Basically, I want to be able to put a URL into my system and it will go out and spider the site and get all the internal URLs (ie page links) for the site. I was going to just phrase site maps, but as there are several different types, and then you have to identify if the site uses a multi page site map etc etc... it appears easier to go down the route of a spider! although I know it will be more process/bandwidth intensive, I believe it would be the best option to ensure that I get everything and alot less coding required
Spidering the pages uses the exact same concepts as spidering the sitemaps. You're making things harder than they need to be, the sitemaps syntax is much simpler than an HTML files & will be easier to parse in the long run.