Hello there! I have a question that I've been stumped on all day (and probably other people as well). I'm trying to use PHP, using CURL to load the source code of WikiPedia articles. While I have this working currently, it fails when I try to go to URLs that include UTF8 characters like the following. $URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_Günthardt"); echo "<br><br>Scanning: " . $URL . " ->"; $ch = curl_init($URL); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 1); $result=curl_exec($ch); echo $result; PHP: When I get the result on the page it puts a ? in replace of the ü. I've looked all over php for functions like url_encode, html_entities and so on etc, but I can't seem to find anything that will work to make the string $URL retain the character ü. It could be any other characters also, I'm just using ü for an example. If anyone can give me an answer or at least a hint in the right direction I would extremely thankful! I don't paying for help either to a solution (I can afford $20 if you desire it). Best, - Joe
Hmmm, this is quite a good question So in your case $result is the contents of the webpage, but when you echo it the output is encoded incorrectly? PHP defaults to ISO-8859-1 rather than UTF-8 so thats the first problem. I would say that your answer lies somewhere in the iconv module http://www.php.net/iconv . The main problem is that the iconv module is not installed by default so you will need to grab the module and install it yourself. Or if you dont have a dedicated box ask your hosting provider to do so. You could try setting something like this at the top of your script: iconv_set_encoding("internal_encoding", "UTF-8"); iconv_set_encoding("output_encoding", "UTF-8"); Although I havent actually tested it id say you can definetely solve the issue using something like that Buy yourself a nice lunch with the $20
*EDIT*, What check can I do to perform to see if it's installed? What do I do if it's beyond my control to install it? Unfortunately no luck, my script looks like this now. iconv_set_encoding("internal_encoding", "UTF-8"); iconv_set_encoding("output_encoding", "UTF-8"); $URL="http://en.wikipedia.org/wiki/Heinz_Günthardt"; echo "<br><br>Scanning: " . $URL . " ->"; $ch = curl_init($URL); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 1); $result=curl_exec($ch); echo $result; PHP: The output is this the following below. Note I just put a * so this site doesn't think it's a link in the http part. Scanning: ht*p://en.wikipedia.org/wiki/Heinz_G�nthardt ->URL passes Check -> HTTP/1.0 301 Moved Permanently Date: Thu, 06 Dec 2007 23:12:23 GMT Server: Apache X-Powered-By: PHP/5.1.4 Vary: Accept-Encoding,Cookie Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Last-Modified: Thu, 06 Dec 2007 23:12:23 GMT Location: http://en.wikipedia.org/wiki/Heinz_GŸnthardt Content-Length: 0 Content-Type: text/html; charset=utf-8 Age: 232 X-Cache: HIT from sq19.wikimedia.org X-Cache-Lookup: HIT from sq19.wikimedia.org:3128 X-Cache: MISS from sq23.wikimedia.org X-Cache-Lookup: MISS from sq23.wikimedia.org:80 Via: 1.0 sq19.wikimedia.org:3128 (squid/2.6.STABLE16), 1.0 sq23.wikimedia.org:80 (squid/2.6.STABLE16) Connection: close As you can see I still get that ? which means that the CURL still goes to the incorrect address. Any other ideas?
Usually there should be no problem. Where does the URL come from? Is it maybe already in UTF-8? If so, try applying utf8_decode() first. Or try: $url = "http://en.wikipedia.org/wiki/" . rawurlencode("Heinz_Günthardt"); PHP: The output of the question mark is in your browser, and it might appear because you're not telling the browser that the content is UTF-8 encoded. It doesn't necessarily mean the URL is wrong when you pass it to cURL. Do you still see the question mark if you add this line above? header('Content-Type: text/html; charset=utf-8'); PHP: Your first code above works for me when I take out the utf8_encode().
Can you show me your exact code you used to get it to work? Still doesn't work for me. I know it's not working, because I echo the results of the CURL and it displays the wrong item, and I know that that page really exists when I type it normally in the browser. you can see the results of my script here. Obviously this is the full version of the script, but the parts I showed above should be all that really matters. I can show 'all' the code if you need in PM, because it is quite long. http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt *EDIT* I temporarily changed the $URL= line to below $URL="http://en.wikipedia.org/wiki/".rawurlencode("Heinz_Günthardt"); It just put %9 instead of the ?, and wikipedia didn't return anything or else it would have shown in the echo $result; (I know this works cause I can test it with google, or another wikipedia article without international characters in it).
Okay after some experimentation and headaches I'm getting closer to figuring this out. When I noticed my scripts would work on other peoples servers but not my own it seems libCURL might not have been compiled with proper UTF-8 support. <?php $URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_Günthardt"); $parts = parse_url($URL); $URL = $parts['scheme'].'://'.$parts['host'].str_replace('%2F','/',urlencode($parts['path'])).($parts['query']?'?'.$parts['query']:''); echo "<br><br>Scanning: " . $URL . " ->"; $ch = curl_init($URL); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 1); $result=curl_exec($ch); echo $result; ?> PHP: Also I should note that my editor I was working in BBEdit, might have corrupted the values when I was testing with it, so this complicates things even further. This is the code for the follow url, http://dnfinder.net/rentacoder/curltest.php. So while it works now I find it's not completely perfect there are still some characters that don't come through correctly. On my multiple international character test on the page below http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt It appears that it looks different on macs and pc's, but that the URL is going to the right place at least (or I hope so anyways). I just hope that the data I grab that might have international characters in it doesn't get corrupted when I save it, which is the next step. I'll keep this updated if anyone has any comments.
Ok the code I added above, plus this, header('Content-Type: text/html; charset=utf-8'); finished the job, now it works =), Keep this thread so others that have this problem can see, thanks!