[help]Curl giving me ????? characters [encoding issue]

Discussion in 'PHP' started by linkinpark2014, Jan 28, 2009.

  1. #1
    hi,

    I'm using curl library to grab the content of an arabic website.. but when i echo the data of that website..i get all characters in ????? format.

    after i checked the website encoding i found that the website is using windows-1256 encoding (arabic)..

    so when curl grabs the content it converts the data to unknown characters automatically.. so when I print the data i get all characters in ???????? format..
    but if i change browser encoding i can get the correct format..

    so my question is there any method to convert windows-1256 encoding into utf-8 format with php??
     
    linkinpark2014, Jan 28, 2009 IP
  2. yoavmatchulsky

    yoavmatchulsky Member

    Messages:
    57
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    48
    #2
    try adding header('Content-Type: text/html; charset=UTF-8');
     
    yoavmatchulsky, Jan 28, 2009 IP
  3. linkinpark2014

    linkinpark2014 Peon

    Messages:
    153
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    php doesnt support windows-1256 format...
     
    linkinpark2014, Jan 28, 2009 IP
  4. yoavmatchulsky

    yoavmatchulsky Member

    Messages:
    57
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    48
    #4
    you just said that if you change your browser's encoding to utf8 it shows ok. so that means that the content you are receiving is in utf8, but the page you are viewing thinks its windows-1256.

    so what header() does is send a HTTP-HEADER line before you send any text, and your browser will know to view the page in utf-8
     
    yoavmatchulsky, Jan 28, 2009 IP
  5. linkinpark2014

    linkinpark2014 Peon

    Messages:
    153
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    the website is using windows-1256, however when curl grabs the content..it treats the data as utf-8..originally its windows-1256
    so what i want is to convert the utf-8 back to windows-1256

    here is the source:
    
    <?php
    #I'm using curl library to grab the content of an arabic website.. but when i echo the data of that website..i get all characters in ????? format.
    #after i checked the website encoding i found that the website is using windows-1256 encoding (arabic)..
    #so when curl grabs the content it converts the data to unknown characters automatically.. so when I print the data i get all characters in ???????? format..
    #but if i change browser encoding i can get the correct format.. 
    
    $url2="http://forum.kooora.com/f.aspx?mode=f&f=169";
    //now show me my post
    
    
    function get_content($url)  
    { 
    $ch = curl_init();  
    
    curl_setopt ($ch, CURLOPT_URL, $url);  
    curl_setopt ($ch, CURLOPT_HEADER, 1); 
    
    $str  = "Accept-Language: en-us,en;q=0.5\r\n";
    $str .= "Accept-Charset: windows-1256;q=0.7,*;q=0.7\r\n";
    $str .= "Keep-Alive: 300\r\n";
    $str .= "Connection: keep-alive\r\n";
           
    curl_setopt($ch, CURLOPT_HTTPHEADER, array($str));
    
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)');  
    curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt'); //saved cookies
    ob_start();  
      
    curl_exec ($ch);  
    curl_close ($ch);  
    $string = ob_get_contents();  
    
    ob_end_clean();  
    
    return $string;      
    }
    
    
    $content = get_content("$url2");  
    $pattern='/"ftnh",(.*?),(.*?)(روابط)(.*?),/'; //this pattern will get all words near "روابط"  <====== here "روابط " already in utf-8 format 
    											  //but when i use preg_match_all function to match this word with the words on the website i get unmatched result, 
    											  //when i browse manually i absoluty can read many words similar to this 1
    
    if(preg_match_all($pattern,$content,$out,PREG_PATTERN_ORDER))
    {
    echo "matched";
    print_r($out);
    }
    
    else
    {
    echo "no match";
    }
    
    
    
    ?>
    
    PHP:
     
    linkinpark2014, Jan 28, 2009 IP