Log in or Sign up

having trouble using regexp to find links in a page

Discussion in 'PHP' started by reza1217, Oct 24, 2007.

reza1217 Peon

Messages:

62

Likes Received:

2

Best Answers:

0

Trophy Points:

0
#1
Hi everyone, I would appreciate if someone helps me regarding the following problem. I am trying to use the following code to extract links from a page

$input = @file_get_contents($input_file) or die('Could not access file: $input_file'); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { # $matches[2] = array of link addresses # $matches[3] = array of link text - including HTML code }

PHP:

everything works fine except the unicode character is not copied properly. For example the following code

<a href="get.php?d=07/10/24/w/p_tkytx">cÃ–_g cvZv</a>

PHP:

is identified as

<a href="get.php?d=07/10/24/w/p_tkytx">cï¿½_g cvZv</a>

PHP:

although I see a ? instead of ï¿½ here. I am sure this is a unicode character code problem. This is a foreign language page I am working on.
Can some one help me copy the exact code as show on the second code box.
reza1217, Oct 24, 2007 IP
nabil_kadimi Well-Known Member

Messages:

1,065

Likes Received:

69

Best Answers:

0

Trophy Points:

195

#2

maybe you can try the mb_ereg() function, i'm not sure !

nabil_kadimi, Oct 24, 2007 IP
nico_swd Prominent Member

Messages:

4,153

Likes Received:

344

Best Answers:

18

Trophy Points:

375

#3

Try using the UTF-8 modifier, the lowercase u.

nico_swd, Oct 24, 2007 IP
reza1217 Peon

Messages:

62

Likes Received:

2

Best Answers:

0

Trophy Points:

0
#4
I figured out the problem as it was the charset causing the problem. I had to use

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

PHP:

instead of

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

PHP:

'
thank you for your help.
reza1217, Oct 24, 2007 IP