1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Need data to contain only standard English characters

Discussion in 'PHP' started by kkibak, May 31, 2007.

  1. #1
    I'm exporting some data (using php) to a csv file for an import into another system, and I'm running into trouble because some foreign characters in my database are disallowed by the service I need to import to.

    So what I want to do is (ideally) convert all non-us-English characters, such as Æ, to the nearest equivalent-for this example, AE.

    If that's not possible, I need to simply remove all non-English characters altogether.

    Any help would be greatly appreciated.
     
    kkibak, May 31, 2007 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    nico_swd, May 31, 2007 IP
    kkibak likes this.
  3. dp-user-1

    dp-user-1 Well-Known Member

    Messages:
    794
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    110
    #3
    Try this:
    <?
    	$data = "Original Data..."; // Your original data
    	$invalid = array("&#38;#192;", "&#38;#198;", "&#38;#199;"); // À, Æ and Ç - add as many as you need using HTML character codes
    	$valid = array("A", "AE", "C"); // "Equivalents"
    	$converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents"
    	$data = ereg_replace("[^A-Za-z0-9]", "", $converted ); // Removes the remaining non-alphanumeric characters
    ?>
    PHP:
     
    dp-user-1, May 31, 2007 IP
    kkibak likes this.
  4. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Thanks for the replies guys, rep left for both. My problem is because I'm using a huge database with literally over 100,000 rows and a ton of data in each row, there could be a ton of these foreign characters. It's not just one or two I need to eliminate, but actually all non-english chars. Do I have to go through and do this replace stuff for every single one? Surely there's a function out there that already does this?
     
    kkibak, May 31, 2007 IP
  5. dp-user-1

    dp-user-1 Well-Known Member

    Messages:
    794
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    110
    #5
    Well, the code I gave you should replace only the characters you input. The rest will simply be removed. (That is, all characters that aren't A-Z, a-z or 1-9.)
     
    dp-user-1, May 31, 2007 IP
  6. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #6
    Very important: Make a backup of your database before applying any code on the data. I guess you knew that already, but I can't repeat it often enough.
     
    nico_swd, May 31, 2007 IP
  7. dp-user-1

    dp-user-1 Well-Known Member

    Messages:
    794
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    110
    #7
    Good point, nico swd.
     
    dp-user-1, May 31, 2007 IP
  8. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Ah okay! I didn't look carefully enough :) Just saw str_replace and the few conversions and thought I knew what you had in mind. I'll give this a whirl.
     
    kkibak, May 31, 2007 IP
  9. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Thanks nico, actually not necessary though because I'm just going to implement this before the export to csv, not on the incoming data.
     
    kkibak, May 31, 2007 IP
  10. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #10
    One last thing--I actually need to keep commas, periods & apostrophes. I suck at regex so would you mind helpin' me with that?
     
    kkibak, May 31, 2007 IP
  11. ansi

    ansi Well-Known Member

    Messages:
    1,483
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    100
    #11
    replace $data = ereg_replace("[^A-Za-z0-9]", "", $converted ); // Removes the remaining non-alphanumeric characters with

    $data = ereg_replace("[^A-Za-z0-9,\.']", "", $converted ); // Removes the remaining non-alphanumeric characters
     
    ansi, May 31, 2007 IP
  12. dp-user-1

    dp-user-1 Well-Known Member

    Messages:
    794
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    110
    #12
    That makes the final code:
    <?
    	$data = "Original Data..."; // Your original data
    	$invalid = array("&#38;#192;", "&#38;#198;", "&#38;#199;"); // À, Æ and Ç - add as many as you need using HTML character codes
    	$valid = array("A", "AE", "C"); // "Equivalents"
    	$converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents"
    	$data = ereg_replace("[^A-Za-z0-9,\.']", "", $converted ); // Removes the remaining non-alphanumeric characters, excluding: commas, periods and apostrophes
    ?>
    PHP:
     
    dp-user-1, May 31, 2007 IP
  13. dp-user-1

    dp-user-1 Well-Known Member

    Messages:
    794
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    110
    #13
    Let us know if it works. :)
     
    dp-user-1, May 31, 2007 IP
  14. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Worked like a charm, thanks so much for the help.

    Not that this is much of a contribution on my part, but I put it together as a function thinking that if we post it here, people can add conversions to it over time and it could gradually improve.

    So if anyone does add more conversions to the function, please reply with your code so we can all benefit :).

    
    <?php
    
    # Please contribute by adding nonenglish characters to the $invalid array and corresponding english chars to the $valid array
    
    function english_only($data) {
    			$data = $param; // Your original data
    			$invalid = array("À", "Æ", "Ç"); // À, Æ and Ç - add as many as you need using HTML character codes
    			$valid = array("A", "AE", "C"); // "Equivalents"
    			$converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents"
    			$data = ereg_replace("[^A-Za-z0-9\.']", "", $converted ); // Removes the remaining non-alphanumeric characters, excluding: commas, periods and apostrophes		
    }
    english_only($stringtoclean); 	
    			
    ?>
    
    Code (markup):
     
    kkibak, May 31, 2007 IP
  15. ansi

    ansi Well-Known Member

    Messages:
    1,483
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    100
    #15
    remove $data = $param; // Your original data. you pass $data to the function.
     
    ansi, May 31, 2007 IP
  16. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #16
    thanks for correcting that, i forgot to reverse some of the changes i'd made for my script--the function i posted here actually was missing the comma as well in the ereg replace

    
    <?php
    
    # Please contribute by adding nonenglish characters to the $invalid array and corresponding english chars to the $valid array
    
    function english_only($data) {
                $invalid = array("À", "Æ", "Ç"); // À, Æ and Ç - add as many as you need using HTML character codes
                $valid = array("A", "AE", "C"); // "Equivalents"
                $converted = str_replace($invalid, $valid, $data); // Replaces the characters with their "equivalents"
                $data = ereg_replace("[^A-Za-z0-9,\.']", "", $converted ); // Removes the remaining non-alphanumeric characters, excluding: commas, periods and apostrophes   
    }
    english_only($stringtoclean);  
               
    ?>
    
    Code (markup):
     
    kkibak, May 31, 2007 IP