10000 Websites Tagging

Discussion in 'Products & Tools' started by _Eugene_, Dec 3, 2009.

  1. #1
    hello there,

    i guess it's time to ask... i was looking or any acceptable solution for 4 days.. looking for the semantic text analyzers, libraries, web-based services (paid and not) but still got no results..

    we need to categorize and tag a list of websites..
    let's say 10.000 items (sites). we have the text from each and it has to be parsed in some way to get a set of words. not just a category "sport", "travel" and so on but a mix of categories and tags based on these websites content. if some words or phrases like "coffee maker" are met 5 times on the page - this is more important for me than just putting it to "electronics" category..
    if is't possible to return 2 or 3 categories + a few tags - it's ok, but if not - i want to get just tags.
    so the tags are unpredictable and may be based only on the content of the website... and should be in the same language as the website itself.

    i'm not mad to create the scripts myself.. this may take a whole life to do this.. and i'm looking for some solution like http://www.alchemyapi.com/ but not so buggy.. and able to analyze non english texts..
    some languages doesnt have "spaces" between words so parsing the text for the most frequent words isn't the best idea aswell..

    the main trouble is these websites are not in 1 or 2 languages.. there are probably all existing languages.. arabic, russian, spanish, czech, german etc etc etc.. i can detect the language and at the final step i need to use some "get_tags" function to extract the tags .

    google has disabled their "site flavored search" a while ago ... may be some other companies provide any api for text analyzing? some good solutions based on many dictionaries and language-specific features.. s

    ive seen some services that have their database categorized even for japanese and arabic website and this makes me sure the truth is somewhere near ;)
     
    _Eugene_, Dec 3, 2009 IP