How to go to URLS that have international characters in it? (UTF-8?)

Discussion in 'PHP' started by x11joex11, Dec 6, 2007.

  1. #1
    Hello there! I have a question that I've been stumped on all day (and probably other people as well).

    I'm trying to use PHP, using CURL to load the source code of WikiPedia articles.

    While I have this working currently, it fails when I try to go to URLs that include UTF8 characters like the following.

    
    $URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_Günthardt");
    echo "<br><br>Scanning: " . $URL . " ->";
        
        $ch = curl_init($URL);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	curl_setopt($ch, CURLOPT_HEADER, 1);
        $result=curl_exec($ch);
    
    echo $result;
    
    PHP:
    When I get the result on the page it puts a ? in replace of the ü. I've looked all over php for functions like url_encode, html_entities and so on etc, but I can't seem to find anything that will work to make the string $URL retain the character ü. It could be any other characters also, I'm just using ü for an example.

    If anyone can give me an answer or at least a hint in the right direction I would extremely thankful! I don't paying for help either to a solution (I can afford $20 if you desire it).

    Best,
    - Joe
     
    x11joex11, Dec 6, 2007 IP
  2. tonybogs

    tonybogs Peon

    Messages:
    462
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Hmmm, this is quite a good question :)

    So in your case $result is the contents of the webpage, but when you echo it the output is encoded incorrectly?

    PHP defaults to ISO-8859-1 rather than UTF-8 so thats the first problem. I would say that your answer lies somewhere in the iconv module http://www.php.net/iconv . The main problem is that the iconv module is not installed by default so you will need to grab the module and install it yourself. Or if you dont have a dedicated box ask your hosting provider to do so.

    You could try setting something like this at the top of your script:

    iconv_set_encoding("internal_encoding", "UTF-8");
    iconv_set_encoding("output_encoding", "UTF-8");

    Although I havent actually tested it id say you can definetely solve the issue using something like that

    Buy yourself a nice lunch with the $20 :)
     
    tonybogs, Dec 6, 2007 IP
  3. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    *EDIT*, What check can I do to perform to see if it's installed? What do I do if it's beyond my control to install it?

    Unfortunately no luck, my script looks like this now.

    iconv_set_encoding("internal_encoding", "UTF-8");
    iconv_set_encoding("output_encoding", "UTF-8");
    
    $URL="http://en.wikipedia.org/wiki/Heinz_Günthardt";
    	
    echo "<br><br>Scanning: " . $URL . " ->";
        
    $ch = curl_init($URL);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    $result=curl_exec($ch);
    
    echo $result;
    PHP:
    The output is this the following below. Note I just put a * so this site doesn't think it's a link in the http part.

    Scanning: ht*p://en.wikipedia.org/wiki/Heinz_G�nthardt ->URL passes Check ->

    HTTP/1.0 301 Moved Permanently Date: Thu, 06 Dec 2007 23:12:23 GMT Server: Apache X-Powered-By: PHP/5.1.4 Vary: Accept-Encoding,Cookie Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Last-Modified: Thu, 06 Dec 2007 23:12:23 GMT Location: http://en.wikipedia.org/wiki/Heinz_GŸnthardt Content-Length: 0 Content-Type: text/html; charset=utf-8 Age: 232 X-Cache: HIT from sq19.wikimedia.org X-Cache-Lookup: HIT from sq19.wikimedia.org:3128 X-Cache: MISS from sq23.wikimedia.org X-Cache-Lookup: MISS from sq23.wikimedia.org:80 Via: 1.0 sq19.wikimedia.org:3128 (squid/2.6.STABLE16), 1.0 sq23.wikimedia.org:80 (squid/2.6.STABLE16) Connection: close

    As you can see I still get that ? which means that the CURL still goes to the incorrect address. Any other ideas?
     
    x11joex11, Dec 6, 2007 IP
  4. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #4
    Usually there should be no problem.

    Where does the URL come from? Is it maybe already in UTF-8? If so, try applying utf8_decode() first.

    Or try:
    
    $url = "http://en.wikipedia.org/wiki/" . rawurlencode("Heinz_Günthardt");
    
    PHP:
    The output of the question mark is in your browser, and it might appear because you're not telling the browser that the content is UTF-8 encoded. It doesn't necessarily mean the URL is wrong when you pass it to cURL.

    Do you still see the question mark if you add this line above?
    
    header('Content-Type: text/html; charset=utf-8');
    
    PHP:
    Your first code above works for me when I take out the utf8_encode().
     
    nico_swd, Dec 6, 2007 IP
  5. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Can you show me your exact code you used to get it to work? Still doesn't work for me. I know it's not working, because I echo the results of the CURL and it displays the wrong item, and I know that that page really exists when I type it normally in the browser.

    you can see the results of my script here. Obviously this is the full version of the script, but the parts I showed above should be all that really matters. I can show 'all' the code if you need in PM, because it is quite long.

    http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt

    *EDIT* I temporarily changed the $URL= line to below

    $URL="http://en.wikipedia.org/wiki/".rawurlencode("Heinz_Günthardt");

    It just put %9 instead of the ?, and wikipedia didn't return anything or else it would have shown in the echo $result; (I know this works cause I can test it with google, or another wikipedia article without international characters in it).
     
    x11joex11, Dec 6, 2007 IP
  6. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Okay after some experimentation and headaches I'm getting closer to figuring this out. When I noticed my scripts would work on other peoples servers but not my own it seems libCURL might not have been compiled with proper UTF-8 support.

    <?php
    $URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_Günthardt");
    
    $parts = parse_url($URL);
    $URL = $parts['scheme'].'://'.$parts['host'].str_replace('%2F','/',urlencode($parts['path'])).($parts['query']?'?'.$parts['query']:'');
    
    echo "<br><br>Scanning: " . $URL . " ->";
        
        $ch = curl_init($URL);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_HEADER, 1);
        $result=curl_exec($ch);
    
    echo $result;
    ?>
    PHP:
    Also I should note that my editor I was working in BBEdit, might have corrupted the values when I was testing with it, so this complicates things even further.

    This is the code for the follow url, http://dnfinder.net/rentacoder/curltest.php.

    So while it works now I find it's not completely perfect there are still some characters that don't come through correctly.

    On my multiple international character test on the page below

    http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt

    It appears that it looks different on macs and pc's, but that the URL is going to the right place at least (or I hope so anyways).

    I just hope that the data I grab that might have international characters in it doesn't get corrupted when I save it, which is the next step.

    I'll keep this updated if anyone has any comments.
     
    x11joex11, Dec 7, 2007 IP
  7. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Ok the code I added above, plus this, header('Content-Type: text/html; charset=utf-8'); finished the job, now it works =), Keep this thread so others that have this problem can see, thanks!
     
    x11joex11, Dec 7, 2007 IP