Is preg_split() the best choice for this?

Discussion in 'PHP' started by AntelopeSalad, Feb 28, 2010.

  1. #1
    Working on search engine functionality and I'm trying to handle queries that contain matched quotes.

    Example:
    input: foo , output: " AND x LIKE '%foo%'"
    input: foo bar , output: " AND x LIKE '%foo%' AND x LIKE '%bar%'"
    input: "foo bar" , output: " AND x LIKE '%foo bar%'"

    At the moment I'm splitting the search string by a space then looping through each word and just building a straight up "AND x LIKE..." string which gets appended to the final query I run. It takes a lot of logic just using explode() to handle the quotes correctly, there has to be a better way.

    Perhaps using preg_split() would be a better alternative? I don't use regexs too often so I'm not really sure if this is even possible, thoughts?

    The end result would be getting an array of terms. In the above example the array should be:

    0 = foo
    1 = foo
    2 = bar
    3 = foo bar

    Or if this is more clear:

    User searches for: foo bar

    Array would be:

    0 = foo
    1 = bar

    User searches for: "foo bar"

    Array would be:
    0 = foo bar

    User searches for: foo "foo bar"

    Array would be:
    0 = foo
    1 = foo bar
     
    AntelopeSalad, Feb 28, 2010 IP
  2. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Anyone? Surely someone has created such a regular expression by now. :D
     
    AntelopeSalad, Mar 1, 2010 IP
  3. Kaizoku

    Kaizoku Well-Known Member

    Messages:
    1,261
    Likes Received:
    20
    Best Answers:
    1
    Trophy Points:
    105
    #3
    Try explode(' ', $input);
     
    Kaizoku, Mar 1, 2010 IP
  4. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Not to be a jerk or anything but if you read my post you'd see that is what I'm doing heh. exploding doesn't really work too well by itself when "foo bar" (in quotes) should be considered 1 term. You can make it work like this but the amount of code to determine if the quotes match and re-splicing words that contain 1 quote back together can't be the best way to do it. It feels so rigged.
     
    AntelopeSalad, Mar 1, 2010 IP
  5. Narrator

    Narrator Active Member

    Messages:
    392
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    80
    #5
    Try using this to check for quotes
    $numMatches=preg_match_all('/"(.*?)"/',$var,$array);
    Code (markup):
    I'm not the best with regex's but it should work
     
    Narrator, Mar 1, 2010 IP
  6. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Hmm that just counts matches. Same effect could be done with $x = $value & 1 with way less overhead unless I'm missing something obvious? & 1 checks if the number is odd or even. 0 = even , 1 = odd. If it's even then we know we have a match. The amount of matches doesn't help us match the words after we explode() it.

    I thought maybe a regex could have existed where it could do something like:
    If the string is surrounded by quotes, replace spaces with double underscores for what's inside of the quotes then return an array of words that are split by spaces.

    Now for a query such as: foo "foo bar"

    The preg_split() would return:
    0 = foo
    1 = foo__bar

    From here we can replace the double underscore with a space and treat it as 1 term in our db query.
     
    AntelopeSalad, Mar 2, 2010 IP
  7. Narrator

    Narrator Active Member

    Messages:
    392
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    80
    #7
    I'm sorry I wasn't really clear with that preg_match. This is how I would do it but it probably isn't the most efficient.

    
    //of course some sanitizing stuff up here
    $count=0;
    $sql="SELECT * FROM table WHERE ";
    //quoted searches
    $numMatches=preg_match_all('/"(.*?)"/',$var,$array);
    	if($numMatches!=0){
    		foreach($array[1] as $value){
    			$count++;
    			if($count>1){$sql.="AND ";}
    			$sql.="x LIKE '%$value%' ";
    			$var=str_replace("\"$value\"","",$var); //remove quoted searches (we're done with them)
    		}//end foreach
    	}//end if
    $var=preg_replace('!\s+!', ' ', $var); //remove multiple spaces
    $var=trim($var); //remove leading & ending spaces
    //non quoted searches
    $array=explode(" ",$var);
    	foreach($array as $value){
    		if($value!=""){ //make sure its not blank
    			$count++;
    			if($count>1){$sql.="AND ";}
    			$sql.="x LIKE '%$value%' ";
    		}//end if
    	}//end foreach
    
    PHP:
    Hope this helps
    Cheers!
     
    Narrator, Mar 2, 2010 IP
  8. JEET

    JEET Notable Member

    Messages:
    3,832
    Likes Received:
    502
    Best Answers:
    19
    Trophy Points:
    265
    #8
    I'm not very good with regex, and may be there's a better way of doing this. However this code might be of use:


    
    $s= 'some foo1 "foo bar" foo2';
    
    function to_array($s){
    $f= preg_replace( '/[^A-Za-z0-9]/', "-", $s);  $f= preg_replace( '/--(.*?)--/', '--"$1"--', $f); 
    $f= explode('--',$f);
    $v=array();
    foreach($f as $a){
    $a=str_replace('-',' ',$a);
    if(substr($a,0,1)!='"'){ $a=explode(' ',$a); foreach($a as $k){ $v[]=$k; } }else{ $a=str_replace('"','',$a); $v[]=$a; }
    }
    return $v;
    }
    
    echo '<pre>'; print_r(to_array($s));
    
    Code (markup):
    Thanks :)
     
    JEET, Mar 2, 2010 IP
  9. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Narrator:
    On a dry run reading directly from the querystring it executes in 0.00013589859008789. This includes everything from start to finish (the SQL query is generated in the end). Only problem is it doesn't strip out 1 quote if a search string happens to be incorrectly formatted.

    Example: foo "foo bar" "hello
    [Not a big deal really, could clean that up no problem]

    JEET:
    Yours is failing for some reason.

    Example: foo "foo bar" returns foo, foo, bar but if I use foo "foo bar" foo then it magically works as intended. I didn't time it because yours is returning the array and not building the SQL Query (not an issue).

    Thanks guys, from here I think I can get a solid implementation of this feature.
     
    AntelopeSalad, Mar 3, 2010 IP
  10. koko5

    koko5 Active Member

    Messages:
    394
    Likes Received:
    14
    Best Answers:
    1
    Trophy Points:
    70
    #10
    Here is my solution:

    
    $input='foo';
    $output = (trim($input,'"')!==$input) ? ' AND x LIKE \'%'.trim($input,'"').'%\'' : preg_replace('#\b(\w+)\b#mui',' AND x LIKE \'%$1%\'',$input);
    echo "{$output}\n";
    
    $input='"foo"';
    $output = (trim($input,'"')!==$input) ? ' AND x LIKE \'%'.trim($input,'"').'%\'' : preg_replace('#\b(\w+)\b#mui',' AND x LIKE \'%$1%\'',$input);
    echo "{$output}\n";
    
    $input='foo bar';
    $output = (trim($input,'"')!==$input) ? ' AND x LIKE \'%'.trim($input,'"').'%\'' : preg_replace('#\b(\w+)\b#mui',' AND x LIKE \'%$1%\'',$input);
    echo "{$output}\n";
    
    $input='"foo bar"';
    $output = (trim($input,'"')!==$input) ? ' AND x LIKE \'%'.trim($input,'"').'%\'' : preg_replace('#\b(\w+)\b#mui',' AND x LIKE \'%$1%\'',$input);
    echo "{$output}\n";
    
    PHP:
    Regards :)
     
    Last edited: Mar 3, 2010
    koko5, Mar 3, 2010 IP
  11. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #11
    koko5,

    Pretty clever usage of preg_replace. I can't even measure it with microtime because it's less than what it can report.

    Fails on:
    Input -> foo "foo bar"
    Output -> AND x LIKE '%foo "foo bar%'

    Other than that this looks like the most elegant approach. Would you mind adjusting it to handle quotes anywhere in the string, as I'm not sure what's going on with that regex (need to research it).
     
    AntelopeSalad, Mar 3, 2010 IP
  12. koko5

    koko5 Active Member

    Messages:
    394
    Likes Received:
    14
    Best Answers:
    1
    Trophy Points:
    70
    #12
    I see.OK, I can suggest this regexp:
    
    $input = 'foo "foo bar"';
    $output = (($input[0]=='"')&preg_match('#"$#mu',trim($input))) ? ' AND x LIKE \'%'.str_replace('"','',trim($input)).'%\'' : preg_replace('#\s?(\S+)\s?#mui',' AND x LIKE \'%$1%\'',str_replace('"','',trim($input)));
    echo "{$output}\n";
    
    PHP:
    ^^This is not problem for sql /double "like foo"/, but you can rebuild it, to handle it.

    
    $input = '"foo "foo bar"';
    
    PHP:
    All you have to do:
    - check if trim($input) is empty string;
    - pass only $input in mysql_real_escape_string.
    Regards :)
     
    koko5, Mar 3, 2010 IP
  13. ChrisMac

    ChrisMac Peon

    Messages:
    28
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Hi, just an idea...

    MySQL alone (with "SELECT ... WHERE LIKE '%...%'") isn't the best option for building search engine like system. Take a look at:

    http://www.sphinxsearch.com/
     
    ChrisMac, Mar 3, 2010 IP
  14. JEET

    JEET Notable Member

    Messages:
    3,832
    Likes Received:
    502
    Best Answers:
    19
    Trophy Points:
    265
    #14
    This is the fix of my function posted earlier: Returns a array which you can use to build the query.

    
    function to_array($s){
    $s.=' ';
    $f= preg_replace( '/[^A-Za-z0-9]/', "-", $s);  $f= preg_replace( '/--(.*?)--/', '--"$1"--', $f); 
    $f= explode('--',$f);
    $v=array();
    foreach($f as $a){
    $a=str_replace('-',' ',$a);
    if(substr($a,0,1)!='"'){ $a=explode(' ',$a); foreach($a as $k){ $v[]=$k; } }else{ $a=str_replace('"','',$a); $v[]=$a; }
    }
    return $v;
    }
    
    Code (markup):
    Thanks :)
     
    JEET, Mar 3, 2010 IP
  15. sunlcik

    sunlcik Peon

    Messages:
    39
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Kindly to see this code:

    I've writen a function for you.Actually,you could change the code by yourself to do other things what you want to do.

    <?php
    $input  = '"foo bar"asd,f';
    $xyz    = my_split($input);
    print_r($xyz);
    function my_split($input)
    {
            if(preg_match_all('|"(.*?)"|si',$input,$arr))
            {
                    //true
                    $str            = preg_replace('|"(.*?)"|si','#',$input);
                    $keywords       = preg_split('|[^\w]+|si',$str);
                    //return array
                    $return         = array_merge($arr[1],$keywords);
            }else{
                    $keywords       = preg_split('|[^\w]+|si',$input);
                    $return         = $keywords;
            }
            //clear empty value.You could do others at here
            return $return;
    }
    
    PHP:
    Sincerelay!
    Wu
     
    sunlcik, Mar 3, 2010 IP
  16. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #16
    This just separates them into 3 search terms. Ideally we would want:

    Input -> foo "foo bar"
    Output -> AND x LIKE '%foo%' AND x LIKE '%foo bar%'

    Your first solution almost worked but it fumbled with the quotes. Original output (from your first post) was:
    Output -> AND x LIKE '%foo "foo bar%'

    All in all, full functionality would be:

    Input -> foo
    Output -> AND x LIKE '%foo%'

    Input -> foo bar
    Output -> AND x LIKE '%foo%' AND x LIKE '%bar%'

    Input -> foo "foo bar"
    Output -> AND x LIKE '%foo%' AND x LIKE '%foo bar%'

    Input -> "foo bar" foo
    Output -> AND x LIKE '%foo bar%' AND x LIKE '%foo%'

    Input -> "foo bar"
    Output -> AND x LIKE '%foo bar%'
     
    AntelopeSalad, Mar 4, 2010 IP
  17. AntelopeSalad

    AntelopeSalad Peon

    Messages:
    85
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #17
    I was thinking about this too but I don't think this project is big enough to really warrant it? When I mentioned search engine in the original post I didn't mean something as huge as Google. The page I plan to write the search functionality for will have:

    1x table with 7,000 rows (3 searchable columns, all about 30 chars in length) accessed by 1-30,000 concurrent users.
    1x table with 16,000 rows (4 searchable columns, in between 2-30 chars in length) accessed by 1-30,000 concurrent users.
    1x table with an unknown amount of rows (2 searchable columns, in between 4-60 chars in length) but this table would only be accessed by a few people at a time (admin features).
     
    AntelopeSalad, Mar 4, 2010 IP
  18. koko5

    koko5 Active Member

    Messages:
    394
    Likes Received:
    14
    Best Answers:
    1
    Trophy Points:
    70
    #18
    This works with all example $input-s:

    
    $arr = (false!==mb_strpos($input,'"',0,'UTF-8')) ? preg_split('#"([^"]+)"#miu',$input,-1,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE) : preg_split('#([^\s"]+)#miu',$input,-1,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
    print_r($arr);
    
    PHP:
    If it's ok , array_walk and array_filter on empty and spaces only elements will do the rest :)
     
    Last edited: Mar 4, 2010
    koko5, Mar 4, 2010 IP
  19. sunlcik

    sunlcik Peon

    Messages:
    39
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Congratulations
     
    sunlcik, Mar 4, 2010 IP