UTF-8 XML Parsing Problem

Discussion in 'PHP' started by Ian, Nov 29, 2008.

  1. #1
    I'm having an issue with the code below that parses and imports my XML feed into my database. Basically the issue is if one of the fields contains a special character, the rest of the text is deleted and only the symbol gets inserted.

    Example:

    <value1>TRADEMARKED TERMâ„¢</value1>

    inserts just the â„¢ upon being imported, and the text before it is being deleted. I stopped the query with the die() function to see what it was trying to import, and the text is getting lost while being parsed, so it's obviously a parsing issue and not a database issue.

    Needless to say I've been at this for a couple of days and am still at a loss. So if there's anyone out there that may know how to solve this one I'd really appreciate it.

    Thanks again - Ian

    
    <?php
    // initialize some variables
    $currentTag = "";
    
    // this array will hold the values for the SQL statement
    $values = array();
    
    // this array will hold allowed fields/elements
    $allowedFields = array("sku", "value1", "value2");
    
    // XML file to parse
    $xml_file="file.xml";
    
    // database parameters
    $host = "localhost";
    $user = "name";
    $pass = "pass";
    $db = "db";
    $table = "table";
    
    // called when parser finds start tag
    function startElementHandler($parser, $name, $attributes)
    {
       global $currentTag;
       $currentTag = $name;
    }
    
    // called when parser finds end tag
    function endElementHandler($parser, $name)
    {
       global $values, $currentTag;
       // import database link and table name
       global $connection, $table;
    
       // if ending <item> tag
       // implies end of record
       if (strtolower($name) == "xmlfield")
       {
          // NOTE: I dont know what the indices of $values are,
          // so you might need to do a debug print_r($values)
          // to find out what they are.
          $sku = $values['sku'];
          $value1 = $values['value1'] ? $values['value1'] : '';
          $value2 = $values['value2'] ? $values['value2'] : '';
       
          // generate the query string
          $query = "INSERT INTO table ";
          $query .= "(sku,value1,value2) ";
          $query .= "VALUES('$sku', '$value1', '$value2');";
    
          // uncomment for debug
          // print $query;
    
          // execute query
          mysql_query("SET CHARACTER SET utf8");
          mysql_query("SET NAMES utf8");
          $result = mysql_query($query);
    
          // reset all internal counters and arrays
          $values = array();
          $currentTag = "";
       }
    
    }
    
    // called when parser finds cdata
    function characterDataHandler($parser, $data)
    {
    
       global $currentTag, $values, $allowedFields;
       // lowercase tag name
       $currentTag = strtolower($currentTag);
    
       // look for tag in $allowedFields[] array
       // to see if it is to be included in query
       if (in_array($currentTag, $allowedFields) && trim($data) != "")
       {
          // add field=>value pairs to $values array
          $values[$currentTag] = mysql_escape_string($data);
       }
    }
    
    // initialize parser
    $xml_parser = xml_parser_create();
    
    // turn off whitespace processing
    xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE, TRUE);
    // turn on case folding
    xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, FALSE);
    
    // set callback functions
    xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler");
    xml_set_character_data_handler($xml_parser, "characterDataHandler");
    
    // open connection to database
    $connection = mysql_connect($host, $user, $pass) or die ("Unable to connect!");
    mysql_select_db($db) or die ("Unable to select database!");
    
    // read XML file
    if (!($fp = fopen($xml_file, "r")))
    {
       die("File I/O error: $xml_file");
    }
    
    // parse XML
    while ($data = fread($fp, 4096))
    {
       // error handler
       if (!xml_parse($xml_parser, $data, feof($fp)))
       {
          $error_code = xml_get_error_code($xml_parser);
          die("XML parser error (error code " . $error_code . "): " . xml_error_string($error_code) . "<br>Error occurred at line " . xml_get_current_line_number($xml_parser));
       }
    }
    
    // all done, clean up!
    xml_parser_free($xml_parser);
    mysql_close($connection);
    
    ?>
    
    
    PHP:
     
    Ian, Nov 29, 2008 IP
  2. artiskool

    artiskool Peon

    Messages:
    34
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #2
    What's PHP version? If PHP5, try using simplexml rather than SAX.

     
    artiskool, Nov 29, 2008 IP
  3. Ian

    Ian Well-Known Member

    Messages:
    409
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    125
    #3
    Still no luck so far. I'll send $10 via PayPal to the person who helps me with this one...I'm baffled :confused:
     
    Ian, Nov 30, 2008 IP
  4. joebert

    joebert Well-Known Member

    Messages:
    2,150
    Likes Received:
    88
    Best Answers:
    0
    Trophy Points:
    145
    #4
    The parser is parsing the character data within the element in two pieces, and this line is causing it to overwrite the previous piece when it gets to the special character piece.

    $values[$currentTag] = mysql_escape_string($data);
    Code (markup):
    You may need to do some fenagling with character encodings within your characterDataHandler, but changing that "=" operator into a ".=" operator will solve your initial issue of missing data.
     
    joebert, Dec 1, 2008 IP
    Ian likes this.
  5. elias_sorensen

    elias_sorensen Well-Known Member

    Messages:
    852
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    110
    #5
    Use htmlspecialchars() :)

    And then put this in the top most of your php file:

    header('Content-type: application/xml; charset="iso-8859-1"');
     
    elias_sorensen, Dec 1, 2008 IP
  6. Ian

    Ian Well-Known Member

    Messages:
    409
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    125
    #6
    That solved it. Thank you for your help and please PM me with your PayPal Info so I can take care of you. Rep will be given as well :cool:
     
    Ian, Dec 3, 2008 IP