Detect encoding and fix, if not utf-8?

Discussion in 'PHP' started by nat000, Jul 2, 2011.

  1. #1
    Hi guys,

    what is the best way to detect text encoding and fix it , if there are non-utf8 characters? :confused:


    the ways I've tried:


    1) mb_detect_encoding - doesn't work properly
    2) iconv("UTF-8","UTF-8//IGNORE",$str) - doesn't work properly
    3) preg_replace with different options...

    the last code I used is the following:

    
    function utf8replacer($captures) {
      if     (!empty($captures[1])) {
        // Valid byte sequence. Return unmodified.
        return $captures[1];
      }
      elseif (!empty($captures[2])) {
        // Invalid byte of the form 10xxxxxx.
        // Encode as 11000010 10xxxxxx.
        return "\xC2".$captures[2];
      }
      else {
        // Invalid byte of the form 11xxxxxx.
        // Encode as 11000011 10xxxxxx.
        return "\xC3".$captures[3];
      }
    }
    $regex = <<<'END'
    /
      ( [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
      | [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
      | [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
      | [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
      )
    | ( [\x80-\xBF] )               # invalid byte in range 10000000 - 10111111
    | ( [\xC0-\xFF] )               # invalid byte in range 11000000 - 11111111
    /x
    END;
    
    preg_replace_callback($regex, "utf8replacer", $txt);
    Code (markup):

    for example, there is not correctly saved text in db:
    digital point® 
    Code (markup):
    on final page, it is displayed as
    digital point� 
    Code (markup):
    the http://validator.w3.org validator shows the following :

    Sorry! This document can not be checked.

    Sorry, I am unable to validate this document because on line 55 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

    The error was: utf8 "\xAE" does not map to Unicode


    When I open the page at line 55 it doesn't contain any characters at all . But I suppose the reason is in
    ® 
    Code (markup):
    character, because if I remove it from db,the validator shows no warnings

    also, if I save my php page as static html and run through the same validator, it is always able to check it, even with
    ® 
    Code (markup):
    ps.
    my php.ini mbstring settings, if needed:

    
    php 5.3+ (tried with 5.3.5, 5.3.2)
    Multibyte Support enabled
    Multibyte string engine libmbfl
    HTTP input encoding translation disabled
    mbstring.language neutral
    mbstring.strict_detection Off
    mbstring.substitute_character no value
    
    Code (markup):
    my php page contains the following header;

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    Code (markup):
     
    nat000, Jul 2, 2011 IP
  2. rais.hussain

    rais.hussain Peon

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Nice post, it worked for me. Thanks.
     
    rais.hussain, Jul 2, 2011 IP