I'm using curl to retrieve webpages and doing text search into it. The problem arises when comparing text from multiple charsets/encoding. A) Search strings are read from a MySQL database with UTF-8 encoding. B) Curl read web content can coded in multiple charsets like iso-8859-1, UTF, SJIS etc. So the rendered output looks the same as the search strings but the underlying content is not, may be due to the charset or may be due to HTML special chars in the web pages like " ’ etc. Example: search string from UTF db: Apple’s Revolutionary App Store Downloads Code (markup): text from web page with charset=iso-8859-1 Apple’s Revolutionary App Store Downloads Code (markup): An easy way to achieve this? The easiest way should be to convert text to use the same charset and then do the comparison.... or?
<?php function un_htmlentities($string) { $trans = get_html_translation_table(HTML_ENTITIES); $trans[chr(130)] = '‚'; // Single Low-9 Quotation Mark $trans[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook $trans[chr(132)] = '„'; // Double Low-9 Quotation Mark $trans[chr(133)] = '…'; // Horizontal Ellipsis $trans[chr(134)] = '†'; // Dagger $trans[chr(135)] = '‡'; // Double Dagger $trans[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent $trans[chr(137)] = '‰'; // Per Mille Sign $trans[chr(138)] = 'Š'; // Latin Capital Letter S With Caron $trans[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark $trans[chr(140)] = 'Œ '; // Latin Capital Ligature OE $trans[chr(145)] = '‘'; // Left Single Quotation Mark $trans[chr(146)] = '’'; // Right Single Quotation Mark $trans[chr(147)] = '“'; // Left Double Quotation Mark $trans[chr(148)] = '”'; // Right Double Quotation Mark $trans[chr(149)] = '•'; // Bullet $trans[chr(150)] = '–'; // En Dash $trans[chr(151)] = '—'; // Em Dash $trans[chr(152)] = '˜'; // Small Tilde $trans[chr(153)] = '™'; // Trade Mark Sign $trans[chr(154)] = 'š'; // Latin Small Letter S With Caron $trans[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark $trans[chr(156)] = 'œ'; // Latin Small Ligature OE $trans[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis ksort($trans); $trans = array_flip($trans); return strtr($string, $trans); } $iso_string = 'Apple’s Revolutionary App Store Downloads'; $utf_string = 'Apple’s Revolutionary App Store Downloads'; //compare... if (un_htmlentities($iso_string) == $utf_string){ echo "The strings match..."; } ?> PHP:
Will it work also with strings from other charsets, possibly with characters like quotes not encode with HTML special chars?
tried your code, the converted text becomes Apple�s Revolutionary App Store Downloads Code (markup):
You could encode both strings using htmlentities($buffer, ENT_QUOTES, 'UTF-8'); then use php's html_entity_decode() function to decode the diff.