1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

having trouble using regexp to find links in a page

Discussion in 'PHP' started by reza1217, Oct 24, 2007.

  1. #1
    Hi everyone, I would appreciate if someone helps me regarding the following problem. I am trying to use the following code to extract links from a page

    
    $input = @file_get_contents($input_file) or die('Could not access file: $input_file');
    	$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    	if(preg_match_all("/$regexp/siU", $input, $matches))
    	{
    	 	# $matches[2] = array of link addresses
    	 	# $matches[3] = array of link text - including HTML code
    	 }
    PHP:
    everything works fine except the unicode character is not copied properly. For example the following code
    <a href="get.php?d=07/10/24/w/p_tkytx">cÖ_g cvZv</a>
    PHP:
    is identified as
    <a href="get.php?d=07/10/24/w/p_tkytx">c�_g cvZv</a>
    PHP:
    although I see a ? instead of � here. I am sure this is a unicode character code problem. This is a foreign language page I am working on.
    Can some one help me copy the exact code as show on the second code box.
    SEMrush
     
    reza1217, Oct 24, 2007 IP
    SEMrush
  2. nabil_kadimi

    nabil_kadimi Well-Known Member

    Messages:
    1,063
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    195
    #2
    maybe you can try the mb_ereg() function, i'm not sure !
     
    nabil_kadimi, Oct 24, 2007 IP
  3. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #3
    Try using the UTF-8 modifier, the lowercase u.
     
    nico_swd, Oct 24, 2007 IP
  4. reza1217

    reza1217 Peon

    Messages:
    62
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I figured out the problem as it was the charset causing the problem. I had to use
    <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
    PHP:
    instead of
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    PHP:
    '
    thank you for your help.
     
    reza1217, Oct 24, 2007 IP