I'm exporting some data (using php) to a csv file for an import into another system, and I'm running into trouble because some foreign characters in my database are disallowed by the service I need to import to. So what I want to do is (ideally) convert all non-us-English characters, such as Æ, to the nearest equivalent-for this example, AE. If that's not possible, I need to simply remove all non-English characters altogether. Any help would be greatly appreciated.
Have a look at this page and the user comments, this should get you started: http://www.php.net/strtr
Try this: <? $data = "Original Data..."; // Your original data $invalid = array("&#192;", "&#198;", "&#199;"); // À, Æ and Ç - add as many as you need using HTML character codes $valid = array("A", "AE", "C"); // "Equivalents" $converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents" $data = ereg_replace("[^A-Za-z0-9]", "", $converted ); // Removes the remaining non-alphanumeric characters ?> PHP:
Thanks for the replies guys, rep left for both. My problem is because I'm using a huge database with literally over 100,000 rows and a ton of data in each row, there could be a ton of these foreign characters. It's not just one or two I need to eliminate, but actually all non-english chars. Do I have to go through and do this replace stuff for every single one? Surely there's a function out there that already does this?
Well, the code I gave you should replace only the characters you input. The rest will simply be removed. (That is, all characters that aren't A-Z, a-z or 1-9.)
Very important: Make a backup of your database before applying any code on the data. I guess you knew that already, but I can't repeat it often enough.
Ah okay! I didn't look carefully enough Just saw str_replace and the few conversions and thought I knew what you had in mind. I'll give this a whirl.
Thanks nico, actually not necessary though because I'm just going to implement this before the export to csv, not on the incoming data.
One last thing--I actually need to keep commas, periods & apostrophes. I suck at regex so would you mind helpin' me with that?
replace $data = ereg_replace("[^A-Za-z0-9]", "", $converted ); // Removes the remaining non-alphanumeric characters with $data = ereg_replace("[^A-Za-z0-9,\.']", "", $converted ); // Removes the remaining non-alphanumeric characters
That makes the final code: <? $data = "Original Data..."; // Your original data $invalid = array("&#192;", "&#198;", "&#199;"); // À, Æ and Ç - add as many as you need using HTML character codes $valid = array("A", "AE", "C"); // "Equivalents" $converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents" $data = ereg_replace("[^A-Za-z0-9,\.']", "", $converted ); // Removes the remaining non-alphanumeric characters, excluding: commas, periods and apostrophes ?> PHP:
Worked like a charm, thanks so much for the help. Not that this is much of a contribution on my part, but I put it together as a function thinking that if we post it here, people can add conversions to it over time and it could gradually improve. So if anyone does add more conversions to the function, please reply with your code so we can all benefit . <?php # Please contribute by adding nonenglish characters to the $invalid array and corresponding english chars to the $valid array function english_only($data) { $data = $param; // Your original data $invalid = array("À", "Æ", "Ç"); // À, Æ and Ç - add as many as you need using HTML character codes $valid = array("A", "AE", "C"); // "Equivalents" $converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents" $data = ereg_replace("[^A-Za-z0-9\.']", "", $converted ); // Removes the remaining non-alphanumeric characters, excluding: commas, periods and apostrophes } english_only($stringtoclean); ?> Code (markup):
thanks for correcting that, i forgot to reverse some of the changes i'd made for my script--the function i posted here actually was missing the comma as well in the ereg replace <?php # Please contribute by adding nonenglish characters to the $invalid array and corresponding english chars to the $valid array function english_only($data) { $invalid = array("À", "Æ", "Ç"); // À, Æ and Ç - add as many as you need using HTML character codes $valid = array("A", "AE", "C"); // "Equivalents" $converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents" $data = ereg_replace("[^A-Za-z0-9,\.']", "", $converted ); // Removes the remaining non-alphanumeric characters, excluding: commas, periods and apostrophes } english_only($stringtoclean); ?> Code (markup):