Hi, Lets say I have an html file containing A tags like this: <a *anything here* href="http://www.url.com" *anything here*>*anything here*</a> <a *anything here* href="folder/somepage.html" *anything here*>*anything here*</a> <a *anything here* href="../somepage.html" *anything here*>*anything here*</a> <a *anything here* href="somepage.php?test=43" *anything here*>*anything here*</a> HTML: The question I have, is how can I use Regular Expressions to get the URL string from the text above? What I mean is I would like to be able to produce a list of urls like this (including the domain etc): http://www.url.com http://www.url.com/folder/somepage.html http://www.url.com/somepage.html http://www.url.com/somepage.php?test=43 Code (markup): Is this easy to do in PHP using Regular Expressions? Also, is there a way to grab the anchor text for each link as well? Cheers ... Gerald.
$reg='/<a(.*)href="(.*)"(.*)>(.*)<\/a>/sU'; preg_match_all($reg,$file_contents,$out); Just look at what you have in $out after this. It should be all you need.
Hi, I did the following as a test: <?php $file_contents = ' <a *anything here* href="http://www.url.com" *any1*>*here1*</a> <a *anything here* href="folder/somepage.html" *any2*>*here2*</a> <a *anything here* href="../somepage.html" *any3*>*here3*</a> <a *anything here* href="somepage.php?test=43" *any4*>*here4*</a>'; $reg= '/<a(.*)href="(.*)"(.*)>(.*)<\/a>/sU'; preg_match_all($reg,$file_contents,$out); foreach ($out as $val) { echo "part 1: " . $val[0] . " <br>\n"; echo "part 2: " . $val[1] . "<br>\n"; echo "part 3: " . $val[3] . "<br>\n"; echo "part 4: " . $val[4] . "<br><br>\n\n"; } PHP: But the output I get is this: part 1: <a *anything here* href="http://www.url.com" *any1*>*here1*</a> <br> part 2: <a *anything here* href="folder/somepage.html" *any2*>*here2*</a><br> part 3: <a *anything here* href="somepage.php?test=43" *any4*>*here4*</a><br> part 4: <br><br> part 1: *anything here* <br> part 2: *anything here* <br> part 3: *anything here* <br> part 4: <br><br> part 1: http://www.url.com <br> part 2: folder/somepage.html<br> part 3: somepage.php?test=43<br> part 4: <br><br> part 1: *any1* <br> part 2: *any2*<br> part 3: *any4*<br> part 4: <br><br> part 1: *here1* <br> part 2: *here2*<br> part 3: *here4*<br> part 4: <br><br> Code (markup): For some reason it's not picking up the third one: <a *anything here* href="../somepage.html" *any3*>*here3*</a> Code (markup): Any ideas? Regards, Gerald.
preg_match_all('/<a.+href="([^"]+)"[^>]*>.+<\/a>/si', $text, $urls); echo '<pre>' . print_r($urls[1], true) .'</pre>'; PHP: This works for me.
Thanks people for all your help!!! I've had a play and this does what I want: <?php $file_contents = ' <a *anything here* href="http://www.url.com" *any1*>*here1*</a> <a *anything here* href="folder/somepage.html" *any2*>*here2*</a> <a *anything here* href="../somepage.html" *any3*>*here3*</a> <a *anything here* href="somepage.php?test=43" *any4*>*here4*</a>'; $reg= '/<a(.*)href="(.*)"(.*)>(.*)<\/a>/sU'; preg_match_all($reg,$file_contents,$out); $result = count($out[0]); echo 'Count: ' . $result . '<br><br>'; echo '<strong>URLs:</strong><br>'; foreach ($out[2] as $val) { echo '<br>' . $val; } echo '<br><br><strong>Anchors:</strong><br>'; foreach ($out[4] as $val) { echo '<br>' . $val; } ?> PHP: Output: Count: 4 URLs: http://www.url.com folder/somepage.html ../somepage.html somepage.php?test=43 Anchors: *here1* *here2* *here3* *here4* Code (markup): Thanks for all your help !!!!!! Green Rep for all of you!
You can Download Windows Script 5.6 Documentation from microsoft to learn Regular Expressions. http://www.microsoft.com/downloads/details.aspx?familyid=01592C48-207D-4BE1-8A76-1C4099D7BBB9&displaylang=en Code (markup):