preg_replace to strip noise words

Discussion in 'PHP' started by grutland, Apr 14, 2010.

  1. #1
    Hi,

    Having some issues with stripping noise words.
    I won't give you the full regular expression I'm using but this is a shortened version which includes a few noise words:
    /\s(?:a|about|after|all|be|because|in|of|the)\s/i
    PHP:
    and just replacing with a space so that it can test for the next word.

    But I'm getting some strange results, the text I'm testing on is "property located in the middle of nature." and getting this returned... "property located the middle nature."
    Any one know why the "the" isn't being stripped?

    Also, another problem I face is what if the noise word is the first or last word in a string.
    There won't be a space both sides of the word, but I need to test for a space both sides other wise it will start stripping out parts of other words.

    Any ideas?
     
    grutland, Apr 14, 2010 IP
  2. Sergey Popov

    Sergey Popov Peon

    Messages:
    29
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Interesting issue. Looks like in your example, regular expression first replaces the in word surrounded with spaces, and after that the is treated as having no space at left (thus not matching your search pattern).

    I suggest that you use array of search patterns instead of one string. Like this:

    
      $content='property located in the middle of nature';
    
      $search = array(
        "/\s(?)\s/i",
        "/\s(:)\s/i",
        "/\s(a)\s/i",
        "/\s(about)\s/i",
        "/\s(after)\s/i",
        "/\s(all)\s/i",
        "/\s(be)\s/i",
        "/\s(because)\s/i",
        "/\s(in)\s/i",
        "/\s(of)\s/i",
        "/\s(the)\s/i"
      );
    
      echo preg_replace($search," ",$content);
    
    
    PHP:
    I tried and it worked fine, resulting this:

    
    property located middle nature
    
    Code (markup):
     
    Sergey Popov, Apr 14, 2010 IP
  3. Kaimi

    Kaimi Peon

    Messages:
    60
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Try this:
    
    /(?<=\s)(?:a|about|after|all|be|because|in|of|the)\s/i
    
    PHP:
     
    Kaimi, Apr 14, 2010 IP
  4. grutland

    grutland Active Member

    Messages:
    86
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    71
    #4
    Hi Kaimi, that works fine.
    Don't suppose you'd like to explain why that works?
    Also, what happens if the noise word is at the end of a sentence?

    Just for the sake of an example, if you had this sentence: "property located in the middle of nature the"
    As well as a full stop "."
     
    grutland, Apr 14, 2010 IP
  5. Kaimi

    Kaimi Peon

    Messages:
    60
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Read "Positive and Negative Lookbehind" at http://www.regular-expressions.info/lookaround.html

    Use \b instead of \s
    
    <?
    $str = "the property located in the middle of nature the";
    echo preg_replace('/(?<=\b)(?:a|about|after|all|be|because|in|of|the)\b/i', '', $str);
    ?>
    
    PHP:
     
    Kaimi, Apr 14, 2010 IP
  6. MyVodaFone

    MyVodaFone Well-Known Member

    Messages:
    1,048
    Likes Received:
    42
    Best Answers:
    10
    Trophy Points:
    195
    #6
    You could also try running
    array_unique
    PHP:
    on your string before preg_replace, in doing so that will strip out any duplicated words like "the"
     
    MyVodaFone, Apr 14, 2010 IP
  7. grutland

    grutland Active Member

    Messages:
    86
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    71
    #7
    But would that work with a string?
    What Kaimi has suggested is working fine and I've now got the desired affect.

    Cheers for the help.
     
    grutland, Apr 14, 2010 IP
  8. nunewnew

    nunewnew Peon

    Messages:
    38
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    t Don 'assume that you' would like to explain why it works? Read "Positive and negative lookahead" in http://www.regular-expressions.info/lookaround.html

    Also, what happens if the noise floor at the end of sentences? Use b, but not S
    
    $ Str = \\ "Property in the heart of nature" , 
    echo preg_replace ( \\'/(?, \\ '\\' , $ Str ) , 
    ?] 
    Code (markup):
    If I have any good idea I will post, thx very much
     
    nunewnew, Apr 29, 2010 IP