I have built a script that categorizes content based upon a training set using the bayesian theorm and ngrams etc. Im having issues when trying to compare the ngrams to categorize the content as the symbols in the database get reaplced by their entities (UTF-8), i.e. £ and £. Im just wondering if anyone has built a similar type of script using the bayesian theorm and has any ideas on whats the best way to get around this? Should i consider removing all non-alphanumeric characters and just forgot? Any help would be appreciated.
You could create a table with the translated symbols for referencing, although this is reinventing the wheel. There are core PHP functions for encoding and decoding html entities. Seeing as core functions are written in C they'll be fast. htmlentities() and html_entity_decode()