Bayesian Theorm - Content Categorizing

Discussion in 'PHP' started by jijihuqw, Jun 15, 2011.

  1. #1
    I have built a script that categorizes content based upon a training set using the bayesian theorm and ngrams etc. Im having issues when trying to compare the ngrams to categorize the content as the symbols in the database get reaplced by their entities (UTF-8), i.e. £ and £.

    Im just wondering if anyone has built a similar type of script using the bayesian theorm and has any ideas on whats the best way to get around this? Should i consider removing all non-alphanumeric characters and just forgot?

    Any help would be appreciated.
     
    jijihuqw, Jun 15, 2011 IP
  2. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #2
    You could create a table with the translated symbols for referencing, although this is reinventing the wheel. There are core PHP functions for encoding and decoding html entities. Seeing as core functions are written in C they'll be fast.

    htmlentities()
    and html_entity_decode()
     
    BRUm, Jun 15, 2011 IP