Detect encoding and fix, if not utf-8?

Hi guys,

what is the best way to detect text encoding and fix it , if there are non-utf8 characters?

the ways I've tried:

1) mb_detect_encoding - doesn't work properly
2) iconv("UTF-8","UTF-8//IGNORE",$str) - doesn't work properly
3) preg_replace with different options...

the last code I used is the following:
function utf8replacer($captures) {
  if     (!empty($captures[1])) {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif (!empty($captures[2])) {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "\xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "\xC3".$captures[3];
  }
}
$regex = <<<'END'
/
  ( [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
  | [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
  | [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
  | [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
  )
| ( [\x80-\xBF] )               # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] )               # invalid byte in range 11000000 - 11111111
/x
END;

preg_replace_callback($regex, "utf8replacer", $txt);
Code (markup):
for example, there is not correctly saved text in db:
digital pointÂ® 
Code (markup):
on final page, it is displayed as
digital pointï¿½ 
Code (markup):
the http://validator.w3.org validator shows the following :

Sorry! This document can not be checked.

Sorry, I am unable to validate this document because on line 55 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xAE" does not map to Unicode

When I open the page at line 55 it doesn't contain any characters at all . But I suppose the reason is in
Â® 
Code (markup):
character, because if I remove it from db,the validator shows no warnings

also, if I save my php page as static html and run through the same validator, it is always able to check it, even with
Â® 
Code (markup):
ps.
my php.ini mbstring settings, if needed:
php 5.3+ (tried with 5.3.5, 5.3.2)
Multibyte Support enabled
Multibyte string engine libmbfl
HTTP input encoding translation disabled
mbstring.language neutral
mbstring.strict_detection Off
mbstring.substitute_character no value
Code (markup):
my php page contains the following header;
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Code (markup):

Log in or Sign up

Detect encoding and fix, if not utf-8?

nat000 Member

rais.hussain Peon

Log in or Sign up

Detect encoding and fix, if not utf-8?

nat000 Member

rais.hussain Peon

Useful Searches