don't use htmlentities() if we are dealing with UTF-8 unicode?

Discussion in 'PHP' started by winterheat, Sep 4, 2008.

  1. #1
    it seesm that htmlspecialchars() in PHP will deal with the usual characters:

    < > & " etc

    and htmlentities() will deal with the other foreign characters like "e" with a ~ or a tick mark on top of it... like Généré

    so if we are dealing with UTF-8 encoded string, then using htmlentities() can be quite dangerous, as some bytes in the string is actually part of UTF-8 but then it may thought to be a foreign character and gets converted into &eacute; for that special é character

    Actually... I found that htmlspecialchars() and htmlentities() can both be dangerous to use if the string has UTF-8 coding, because either function may see a "<" and converts it into &lt; while actually the "<" character is the second or third byte of a UTF-8 character. (each UTF-8 char can be 1 to 4 bytes long).

    But in either case, if we specify the third argument to be "UTF-8", then both functions will handle unicode properly.
     
    winterheat, Sep 4, 2008 IP