Hello, I'd like to ask help for the following problem. I have 3 txt. files - A, B and C - with 10-15k words in each. I'd like to know that how many same words (and which words) are included in A-B, A-C, B-C and in A-B-C files. Probably in Excel? How? Thanks for all help.
I would write a program in PHP to split by word then add each word to a database. Then group the database by each word and include the totals in another column. You would end up with WORD | Number of Occurances in a table. What do you think?
In PHP, preg_replace all runs of whitespace characters with a single space, then explode them to an array. You can then use PHP's array_diff function: http://php.net/manual/en/function.array-diff.php To compare those arrays and get back a list of words common to those files... If I have time later I'll try to remember to revisit this and toss together a quick demo of that.
I do something like what you want in excel. First I would convert the text files to a single word per line file so I could drop each of your text files into a single excel column - with one word per cell. Then you just compare Column A to Column B. I put this formula in Column C. Basically it tells me for each entry in Column B, does that same value exist anywhere in Column A. You can then sort by those results so you have just the words that are in Columns A and B, etc. =IF(COUNTIF($A:$A, $B3)<>0, "In Column A", "Not In Column A") p.s. This is probably easier done in PHP, but I am old school and like to figure shit out with the tools I know.
Here we go, actual working tested code. <?php $file1 = preg_split('/\W+/',file_get_contents('content1.txt')); $file2 = preg_split('/\W+/',file_get_contents('content2.txt')); $file3 = preg_split('/\W+/',file_get_contents('content3.txt')); echo ' <h1>Like words in files demo</h1> <h2>Words common to content1.txt and content2.txt</h2> <pre>',print_r(array_diff($file1,$file2)),'</pre> <h2>Words common to content2.txt and content3.txt</h2> <pre>',print_r(array_diff($file2,$file3)),'</pre> <h2>Words common to content1.txt and content3.txt</h2> <pre>',print_r(array_diff($file1,$file3)),'</pre> <h2>Words common to all three files</h2> <pre>',print_r(array_diff($file1,$file2,$file3)),'</pre>'; ?> Code (markup): Laugh is I completely forgot about preg_split... which saves a step.
Few lines in Python, just create a dict mapping every key to the amount of appearances it has and process word word in each file.