Hi guys, I whipped up a little something together yesterday through PHP/MySQL to start at the home page of a website, grab all the links on the page, and add them to a table along with a cached copy of the page. My website has about 20,000 pages that probably need to be crawled, and I set my cron job to run once every minute, so this could be a month-long wait for me. Part of my problem is that I have a lot of duplicate content pages. Eventually, once I have my entire site crawled, I'd like to make something to calculate PageRank for the individual pages. This was meant to be a small project for me, but I think my script might be useful to others' as well. www.nfreak.net/stats/code.txt www.nfreak.net/stats HTML: If people find this useful, I could also post my code for the pagerank calculation once that is finished. What I did is I restricted the urls to only pages within my site, because if the bot leaked it's way out of the site it would find the entire Internet. Thanks, and good luck with your websites
I will be coding that as soon as my bot finishes crawling and indexing my whole site, which could take up to a month I'm thinking. Plus, the PageRank calculation bot will will coincide with the crawled pages. Thus, the PR script won't work unless you have used the code in the script linked above to index all of your pages. If you need help getting the code to run on your site, just message me on AIM and I can help you out.
You mean my bot? What it does is start with the homepage, which you add to the code before executing it (pretty simple to figure out), and then it grabs all the links and adds them to the database with a crawled value of "N". Then the next time the script is run it will find the first uncrawled page and find all the links, change it to a "Y" and continue on until all the pages are complete. Once it is finished, you will have two tables in your database. One has a list of all pages and a default PageRank value. The other table has a list of urls on each page. The second table with the urls will be used by the PageRank script (coming soon) to disperse the PR between the linked-to pages.