Comparing text from multiple charsets - how to?

Discussion in 'PHP' started by Blutarsky, May 22, 2010.

  1. #1
    I'm using curl to retrieve webpages and doing text search into it.
    The problem arises when comparing text from multiple charsets/encoding.

    A) Search strings are read from a MySQL database with UTF-8 encoding.

    B) Curl read web content can coded in multiple charsets like iso-8859-1, UTF, SJIS etc.

    So the rendered output looks the same as the search strings but the underlying content is not, may be due to the charset or may be due to HTML special chars in the web pages like " ’ etc.

    Example:
    search string from UTF db:
    Apple’s Revolutionary App Store Downloads
    Code (markup):
    text from web page with charset=iso-8859-1
    Apple’s Revolutionary App Store Downloads
    Code (markup):

    An easy way to achieve this? The easiest way should be to convert text to use the same charset and then do the comparison.... or?
     
    Blutarsky, May 22, 2010 IP
  2. danx10

    danx10 Peon

    Messages:
    1,179
    Likes Received:
    44
    Best Answers:
    2
    Trophy Points:
    0
    #2
    <?php
    
    function un_htmlentities($string)
    {
    
        $trans = get_html_translation_table(HTML_ENTITIES);
        $trans[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
        $trans[chr(131)] = '&fnof;';    // Latin Small Letter F With Hook
        $trans[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
        $trans[chr(133)] = '&hellip;';    // Horizontal Ellipsis
        $trans[chr(134)] = '&dagger;';    // Dagger
        $trans[chr(135)] = '&Dagger;';    // Double Dagger
        $trans[chr(136)] = '&circ;';    // Modifier Letter Circumflex Accent
        $trans[chr(137)] = '&permil;';    // Per Mille Sign
        $trans[chr(138)] = '&Scaron;';    // Latin Capital Letter S With Caron
        $trans[chr(139)] = '&lsaquo;';    // Single Left-Pointing Angle Quotation Mark
        $trans[chr(140)] = '&OElig;    ';    // Latin Capital Ligature OE
        $trans[chr(145)] = '&lsquo;';    // Left Single Quotation Mark
        $trans[chr(146)] = '&rsquo;';    // Right Single Quotation Mark
        $trans[chr(147)] = '&ldquo;';    // Left Double Quotation Mark
        $trans[chr(148)] = '&rdquo;';    // Right Double Quotation Mark
        $trans[chr(149)] = '&bull;';    // Bullet
        $trans[chr(150)] = '&ndash;';    // En Dash
        $trans[chr(151)] = '&mdash;';    // Em Dash
        $trans[chr(152)] = '&tilde;';    // Small Tilde
        $trans[chr(153)] = '&trade;';    // Trade Mark Sign
        $trans[chr(154)] = '&scaron;';    // Latin Small Letter S With Caron
        $trans[chr(155)] = '&rsaquo;';    // Single Right-Pointing Angle Quotation Mark
        $trans[chr(156)] = '&oelig;';    // Latin Small Ligature OE
        $trans[chr(159)] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
        ksort($trans);
        $trans = array_flip($trans);
        return strtr($string, $trans);
    }
    
    $iso_string = 'Apple&rsquo;s Revolutionary App Store Downloads';
    $utf_string = 'Apple’s Revolutionary App Store Downloads';
    
    //compare...
    if (un_htmlentities($iso_string) == $utf_string){
    echo "The strings match...";
    }
    
    ?>
    PHP:
     
    danx10, May 22, 2010 IP
  3. Blutarsky

    Blutarsky Peon

    Messages:
    28
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Will it work also with strings from other charsets, possibly with characters like quotes not encode with HTML special chars?
     
    Blutarsky, May 22, 2010 IP
  4. Blutarsky

    Blutarsky Peon

    Messages:
    28
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    tried your code, the converted text becomes

    Apple�s Revolutionary App Store Downloads
    Code (markup):
     
    Blutarsky, May 22, 2010 IP
  5. lordspace

    lordspace Peon

    Messages:
    38
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    You could encode both strings using htmlentities($buffer, ENT_QUOTES, 'UTF-8'); then use php's html_entity_decode() function to decode the diff.
     
    lordspace, May 22, 2010 IP
  6. Blutarsky

    Blutarsky Peon

    Messages:
    28
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    no preblem would occur re-encoding the allready UTF-8 encoded string?
     
    Blutarsky, May 22, 2010 IP