Grr.. Need help with a preg_match problem

Discussion in 'PHP' started by lggmaster, Dec 29, 2007.

  1. #1
    maybe im completely wrong with how i should be doing it, but i have used pregmatch pregmatchall, etc.

    im looking to take a list of text, strip out words like 'and', 'the', etc., meanwhile looping the left over words of the text to be placed into 3 different tables.

    say for example a basic keyword density tool, if that can help understand what im looking for.
     
    lggmaster, Dec 29, 2007 IP
  2. Barti1987

    Barti1987 Well-Known Member

    Messages:
    2,703
    Likes Received:
    115
    Best Answers:
    0
    Trophy Points:
    185
    #2
    Use str_replace to remove the words you want. Then just explode the remaining text and loop through it.

    Peace,
     
    Barti1987, Dec 30, 2007 IP
  3. joebert

    joebert Well-Known Member

    Messages:
    2,150
    Likes Received:
    88
    Best Answers:
    0
    Trophy Points:
    145
    #3
    <?php
    	// Get article
    	$text = file_get_contents('test.txt');
    
    	// Replace non-word characters with whitespace. You can thank "w00t" for the \d in the pattern...
    	$text = preg_replace('#[^a-z\d\s]+#i', ' ', $text);
    
    	// Replace multiple concurrent whitespace with a single space
    	$text = preg_replace('#\s{2,}#', ' ', $text);
    
    	// Reserve a place for words
    	$words = array();
    
    	// Split article into words
    	$text = explode(' ', trim($text));
    
    	// Turn $words into an associative array with words as the keys & their counts as the values
    	foreach($text as &$word)
    	{
    		// Make sure "The" and "the" are counted the same
    		$word = strtolower($word);
    		// If this word already has an entry, just increment its' counter, otherwise register the word
    		isset($words[$word]) ? $words[$word]++ : ($words[$word] = 1);
    	}
    	// Don't need $text anymore
    	unset($text);
    
    	// Get the list of stopwords "the", "an", "and", etc. Each stopword is on its' own line.
    	$stopwords = file('stopwords.txt');
    
    	// Loop through the $stopwords, if there's an entry for a $stopword in $words, get rid of it that entry.
    	foreach($stopwords as &$word)
    	{
    		// Trim the fat
    		$word = trim($word);
    		// Found & removed
    		if(isset($words[$word]))
    		{
    			unset($words[$word]);
    		}
    	}
    	
    	// Don't need these anymore
    	unset($stopwords);
    	
    	// Sort the array with highest count first,
    	// use "arsort" so the word keys aren't replaced with numeric keys, which would defeat the entire purpose.
    	arsort($words);
    	echo '<pre>', print_r($words, true), '</pre>';
    ?>
    PHP:
     
    joebert, Dec 30, 2007 IP