1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP to categorize 500k websites automatically

Discussion in 'PHP' started by adfly, Feb 14, 2010.

  1. #1
    Hi everyone,

    I am in need to categorize over 500k web pages in to one of 16 categories and just wondering the best way to go about it.

    First - is there an open source out there that will currently do this? I can't find anything but I don't want to re-invent the wheel..

    Here is what I have planned:

    1. Using PHP and cURL, search Google for each category i.e. 'finance' then get the top 50 words from the top 50 results and build a table of keyword relevancy for that category, repeat for every category.

    2. Again using cURL, visit each of the 500k website (going to take some time..) and get the top 10 keywords (by density) from each one.

    3. Write a script to match the top 10 keywords from each site to the keywords for each category. Have to put in some logic to say keyword at position 1 in a category is more powerful than position 10.

    What do you think? Any massive flaws in this method? Are there better ways of going about this?

    Thanks for any input.

    Regards,

    Ian
     
    adfly, Feb 14, 2010 IP
  2. SmallPotatoes

    SmallPotatoes Peon

    Messages:
    1,321
    Likes Received:
    41
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I can imagine you're going to have to do a few iterations, primarily to come up with a list of red herring stopwords.
     
    SmallPotatoes, Feb 14, 2010 IP
  3. krsix

    krsix Peon

    Messages:
    435
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I imagine this will take a very long time, and will probably get that IP range banned from either Google or individual webhosts if you don't include an absolutely massive wait time..
     
    krsix, Feb 14, 2010 IP
  4. adfly

    adfly Active Member

    Messages:
    25
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    88
    #4
    Yeah agreed - there is probably a better way of getting the keyword list for each category.

    Regarding crawling the websites, I guess there must some sort of etiquette / guidelines for this so as to not get any IPs banned.

    Does nobody know of any open source software that does this already?
     
    adfly, Feb 14, 2010 IP
  5. ngcoders

    ngcoders Active Member

    Messages:
    206
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    55
    #5
    Your IP will get banned pretty soon.

    You have to use alexa api if possible and mix it with delicious - it will give what the site is about. Then have some sort of classification mechanism to categorize it.

    Not there is no OS script for this , and Alexa API is paid. Also crawling sites for keywords WILL GIVE wrong results ... a lot of people spam keywords.
     
    ngcoders, Feb 15, 2010 IP