having trouble using regexp to find links in a page

Discussion in 'PHP' started by reza1217, Oct 24, 2007.

  1. #1
    Hi everyone, I would appreciate if someone helps me regarding the following problem. I am trying to use the following code to extract links from a page

    
    $input = @file_get_contents($input_file) or die('Could not access file: $input_file');
    	$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    	if(preg_match_all("/$regexp/siU", $input, $matches))
    	{
    	 	# $matches[2] = array of link addresses
    	 	# $matches[3] = array of link text - including HTML code
    	 }
    PHP:
    everything works fine except the unicode character is not copied properly. For example the following code
    <a href="get.php?d=07/10/24/w/p_tkytx">cÖ_g cvZv</a>
    PHP:
    is identified as
    <a href="get.php?d=07/10/24/w/p_tkytx">c�_g cvZv</a>
    PHP:
    although I see a ? instead of � here. I am sure this is a unicode character code problem. This is a foreign language page I am working on.
    Can some one help me copy the exact code as show on the second code box.
     
    reza1217, Oct 24, 2007 IP
  2. nabil_kadimi

    nabil_kadimi Well-Known Member

    Messages:
    1,065
    Likes Received:
    69
    Best Answers:
    0
    Trophy Points:
    195
    #2
    maybe you can try the mb_ereg() function, i'm not sure !
     
    nabil_kadimi, Oct 24, 2007 IP
  3. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #3
    Try using the UTF-8 modifier, the lowercase u.
     
    nico_swd, Oct 24, 2007 IP
  4. reza1217

    reza1217 Peon

    Messages:
    62
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I figured out the problem as it was the charset causing the problem. I had to use
    <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
    PHP:
    instead of
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    PHP:
    '
    thank you for your help.
     
    reza1217, Oct 24, 2007 IP