Log in or Sign up

Bayesian Theorm - Content Categorizing

Discussion in 'PHP' started by jijihuqw, Jun 15, 2011.

jijihuqw Peon

Messages:

20

Likes Received:

0

Best Answers:

0

Trophy Points:

0

#1

I have built a script that categorizes content based upon a training set using the bayesian theorm and ngrams etc. Im having issues when trying to compare the ngrams to categorize the content as the symbols in the database get reaplced by their entities (UTF-8), i.e. Â£ and £.

Im just wondering if anyone has built a similar type of script using the bayesian theorm and has any ideas on whats the best way to get around this? Should i consider removing all non-alphanumeric characters and just forgot?

Any help would be appreciated.

jijihuqw, Jun 15, 2011 IP
BRUm Well-Known Member

Messages:

3,086

Likes Received:

61

Best Answers:

1

Trophy Points:

100

#2

You could create a table with the translated symbols for referencing, although this is reinventing the wheel. There are core PHP functions for encoding and decoding html entities. Seeing as core functions are written in C they'll be fast.

htmlentities() and html_entity_decode()

BRUm, Jun 15, 2011 IP

(You must log in or sign up to reply here.)