[help]Curl giving me ????? characters [encoding issue]

linkinpark2014 Peon

Messages:: 153

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

hi,

I'm using curl library to grab the content of an arabic website.. but when i echo the data of that website..i get all characters in ????? format.

after i checked the website encoding i found that the website is using windows-1256 encoding (arabic)..

so when curl grabs the content it converts the data to unknown characters automatically.. so when I print the data i get all characters in ???????? format..
but if i change browser encoding i can get the correct format..

so my question is there any method to convert windows-1256 encoding into utf-8 format with php??

linkinpark2014, Jan 28, 2009 IP

yoavmatchulsky Member

Messages:: 57

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 48

#2

try adding header('Content-Type: text/html; charset=UTF-8');

yoavmatchulsky, Jan 28, 2009 IP

linkinpark2014 Peon

Messages:: 153

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#3

php doesnt support windows-1256 format...

linkinpark2014, Jan 28, 2009 IP

yoavmatchulsky Member

Messages:: 57

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 48

#4

you just said that if you change your browser's encoding to utf8 it shows ok. so that means that the content you are receiving is in utf8, but the page you are viewing thinks its windows-1256.

so what header() does is send a HTTP-HEADER line before you send any text, and your browser will know to view the page in utf-8

yoavmatchulsky, Jan 28, 2009 IP

linkinpark2014 Peon

Messages:: 153

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

the website is using windows-1256, however when curl grabs the content..it treats the data as utf-8..originally its windows-1256
so what i want is to convert the utf-8 back to windows-1256

here is the source:


<?php
#I'm using curl library to grab the content of an arabic website.. but when i echo the data of that website..i get all characters in ????? format.
#after i checked the website encoding i found that the website is using windows-1256 encoding (arabic)..
#so when curl grabs the content it converts the data to unknown characters automatically.. so when I print the data i get all characters in ???????? format..
#but if i change browser encoding i can get the correct format.. 

$url2="http://forum.kooora.com/f.aspx?mode=f&f=169";
//now show me my post


function get_content($url)  
{ 
$ch = curl_init();  

curl_setopt ($ch, CURLOPT_URL, $url);  
curl_setopt ($ch, CURLOPT_HEADER, 1); 

$str  = "Accept-Language: en-us,en;q=0.5\r\n";
$str .= "Accept-Charset: windows-1256;q=0.7,*;q=0.7\r\n";
$str .= "Keep-Alive: 300\r\n";
$str .= "Connection: keep-alive\r\n";
       
curl_setopt($ch, CURLOPT_HTTPHEADER, array($str));

curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)');  
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt'); //saved cookies
ob_start();  
  
curl_exec ($ch);  
curl_close ($ch);  
$string = ob_get_contents();  

ob_end_clean();  

return $string;      
}


$content = get_content("$url2");  
$pattern='/"ftnh",(.*?),(.*?)(Ø±ÙˆØ§Ø¨Ø·)(.*?),/'; //this pattern will get all words near "Ø±ÙˆØ§Ø¨Ø·"  <====== here "Ø±ÙˆØ§Ø¨Ø· " already in utf-8 format 
											  //but when i use preg_match_all function to match this word with the words on the website i get unmatched result, 
											  //when i browse manually i absoluty can read many words similar to this 1

if(preg_match_all($pattern,$content,$out,PREG_PATTERN_ORDER))
{
echo "matched";
print_r($out);
}

else
{
echo "no match";
}



?>

PHP:

linkinpark2014, Jan 28, 2009 IP

Log in or Sign up

[help]Curl giving me ????? characters [encoding issue]

linkinpark2014 Peon

yoavmatchulsky Member

linkinpark2014 Peon

yoavmatchulsky Member

linkinpark2014 Peon

Useful Searches