Working on a Googlebot

Discussion in 'Google' started by NFreak, Feb 1, 2009.

  1. #1
    Hi guys,

    I whipped up a little something together yesterday through PHP/MySQL to start at the home page of a website, grab all the links on the page, and add them to a table along with a cached copy of the page.

    My website has about 20,000 pages that probably need to be crawled, and I set my cron job to run once every minute, so this could be a month-long wait for me. Part of my problem is that I have a lot of duplicate content pages.

    Eventually, once I have my entire site crawled, I'd like to make something to calculate PageRank for the individual pages. This was meant to be a small project for me, but I think my script might be useful to others' as well.

    
    www.nfreak.net/stats/code.txt
    www.nfreak.net/stats
    HTML:
    If people find this useful, I could also post my code for the pagerank calculation once that is finished. What I did is I restricted the urls to only pages within my site, because if the bot leaked it's way out of the site it would find the entire Internet. :p

    Thanks, and good luck with your websites :)
     
    NFreak, Feb 1, 2009 IP
  2. FREE BET

    FREE BET Peon

    Messages:
    927
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #2
    i would like to see the code on pagerank calculation if its ok with u...
     
    FREE BET, Feb 1, 2009 IP
  3. NFreak

    NFreak Peon

    Messages:
    38
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I will be coding that as soon as my bot finishes crawling and indexing my whole site, which could take up to a month I'm thinking. Plus, the PageRank calculation bot will will coincide with the crawled pages. Thus, the PR script won't work unless you have used the code in the script linked above to index all of your pages.

    If you need help getting the code to run on your site, just message me on AIM and I can help you out.
     
    NFreak, Feb 1, 2009 IP
  4. lindamood1

    lindamood1 Active Member

    Messages:
    1,705
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    78
    #4
    can u deliver some brief guidance about googlebot
     
    lindamood1, Feb 1, 2009 IP
  5. NFreak

    NFreak Peon

    Messages:
    38
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    You mean my bot? What it does is start with the homepage, which you add to the code before executing it (pretty simple to figure out), and then it grabs all the links and adds them to the database with a crawled value of "N". Then the next time the script is run it will find the first uncrawled page and find all the links, change it to a "Y" and continue on until all the pages are complete.

    Once it is finished, you will have two tables in your database. One has a list of all pages and a default PageRank value. The other table has a list of urls on each page. The second table with the urls will be used by the PageRank script (coming soon) to disperse the PR between the linked-to pages.
     
    NFreak, Feb 1, 2009 IP