I'm looking for literature on how to go about this but I'm a bit lost on what to even search for. it seems pointless to reinvent the wheel and I'm sure literature out there exists on how a program like this should be structured. I am assuming bayesian probability would be used to run the likelyhood of words [or phrases] occurring on any page on the internet against the number of times it appears on any specific page. I understand that. but to do that, wouldn't you need to analyze every word [and combination of words] found on the page? this doesn't seem practical for obvious reasons. I'm not even asking for anyone to lay out a solution, but if you could even point me in the right direction on where to find answers to a problem like this it'd be much appreciated !