Searching After An Algorithm To Compare Multi Texts

Discussion in 'PHP' started by nitsanbn, Dec 4, 2009.

  1. #1
    Hi,

    I have a client who wants to have the ability of comparing documents.
    He wants to be able to load a bunch of papers (HTML files) and check if there are similar sentences or phrases (more than 2 words).

    I have successfully converted the HTML files into plain texts, and I want to host them in the database in order to index them and compare between them in future.

    I am not sure of the way my mysql tables should look alike, but I know that the text is UTF8 and I want to compare between the texts.

    Some links and information would be highly appreciated!

    I'm mostly searching after an algorithm rather than a piece of code!

    Thank you for your time!!!

    :)
     
    nitsanbn, Dec 4, 2009 IP
  2. AustinQPT

    AustinQPT Member

    Messages:
    75
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    43
    #2
    Save them into mysql db as a longtext and then load them by having 2 includes each with a get on the end or by using curl and by targeting the id associated with the longtext mysql variable get the text saved into the longtext field to print out. Then use array_count_values and use str_replace to turn words with 2 or less letters to disappear. Then when the array finds words have a % value based on the length of the word and the length of the remaining trimmed down document. Save the % and word in a seperate table. Then make another two pages that load the entire document and do not trim the doc down. Highlight the words in the document using a while loop and desc ids. Highlight different colors based on the severity of the match. i.e. red for bad orange for moderate..ect

    Sorry if I babbled on but this is the best I can do for you. It is quite the task.

    Good Luck
    -AustinQPT
     
    AustinQPT, Dec 5, 2009 IP