Searching for the same words in 3 txt. files

arandon Active Member

Messages:: 53

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#1

Hello,
I'd like to ask help for the following problem. I have 3 txt. files - A, B and C - with 10-15k words in each. I'd like to know that how many same words (and which words) are included in A-B, A-C, B-C and in A-B-C files.
Probably in Excel? How?
Thanks for all help.

Solved! View solution.

arandon, Mar 16, 2013 IP

projectWORD Active Member Best Answer

Messages:: 287

Likes Received:: 2

Best Answers:: 1

Trophy Points:: 63

#2

I would write a program in PHP to split by word then add each word to a database. Then group the database by each word and include the totals in another column. You would end up with WORD | Number of Occurances in a table. What do you think?

projectWORD, Mar 25, 2013 IP

Feriscool Greenhorn

Messages:: 99

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 23

#3

Java.

Input your file. Search through it with loops to find your specific word. Output the results.

Feriscool, Mar 25, 2013 IP

arandon Active Member

Messages:: 53

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#4

Thanks for both help. Sounds very logical, will try soon.

arandon, Mar 26, 2013 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#5

In PHP, preg_replace all runs of whitespace characters with a single space, then explode them to an array. You can then use PHP's array_diff function:

http://php.net/manual/en/function.array-diff.php

To compare those arrays and get back a list of words common to those files... If I have time later I'll try to remember to revisit this and toss together a quick demo of that.

deathshadow, Mar 28, 2013 IP

browntwn Illustrious Member

Messages:: 8,347

Likes Received:: 848

Best Answers:: 7

Trophy Points:: 435

#6

I do something like what you want in excel. First I would convert the text files to a single word per line file so I could drop each of your text files into a single excel column - with one word per cell. Then you just compare Column A to Column B. I put this formula in Column C. Basically it tells me for each entry in Column B, does that same value exist anywhere in Column A. You can then sort by those results so you have just the words that are in Columns A and B, etc.

=IF(COUNTIF($A:$A, $B3)<>0, "In Column A", "Not In Column A")

p.s. This is probably easier done in PHP, but I am old school and like to figure shit out with the tools I know.

browntwn, Mar 28, 2013 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#7

Here we go, actual working tested code.

<?php

$file1 = preg_split('/\W+/',file_get_contents('content1.txt'));
$file2 = preg_split('/\W+/',file_get_contents('content2.txt'));
$file3 = preg_split('/\W+/',file_get_contents('content3.txt'));

echo '
	<h1>Like words in files demo</h1>
	
	<h2>Words common to content1.txt and content2.txt</h2>
	<pre>',print_r(array_diff($file1,$file2)),'</pre>
	
	<h2>Words common to content2.txt and content3.txt</h2>
	<pre>',print_r(array_diff($file2,$file3)),'</pre>
	
	<h2>Words common to content1.txt and content3.txt</h2>
	<pre>',print_r(array_diff($file1,$file3)),'</pre>
	
	<h2>Words common to all three files</h2>
	<pre>',print_r(array_diff($file1,$file2,$file3)),'</pre>';
?>

Code (markup):

Laugh is I completely forgot about preg_split... which saves a step.

deathshadow, Mar 28, 2013 IP

matessim Active Member

Messages:: 514

Likes Received:: 5

Best Answers:: 1

Trophy Points:: 70

#8

Few lines in Python, just create a dict mapping every key to the amount of appearances it has and process word word in each file.

matessim, Mar 30, 2013 IP

Log in or Sign up

Searching for the same words in 3 txt. files

arandon Active Member

projectWORD Active Member Best Answer

Feriscool Greenhorn

arandon Active Member

deathshadow Acclaimed Member

browntwn Illustrious Member

deathshadow Acclaimed Member

matessim Active Member

Useful Searches