Hello, How can I write a regular expression to match all on-domain links on a page but not off-domain links. Here's the regex that matches any link in a <a> tag (the link matches the parenthesized set): $link_regex = '#<a\b[^>]*\bhref=["]([^"]+)["][^>]*>#is'; But if I know the domain name (say it's http://www.example.com) and I want to match all relative links on the page and the absolute ones that are on-domain only (Some pages have abs. links while others have rel. links). How can I write the following in regex: If the link starts with http://www.example.com OR if it does not start with http:// then match it? Now if the link is on-domain and absolute, then the first part of the condition would be true, the second false, true OR false = true => a match. If the link is on-domain relative, then the first part of the condition would be false, the second true, false OR true = true => a match. If the link if off-domain, then the first part of the condition would be false, the second false (it has to start with http:// since it's off-domain), false OR false = false => no match. The problem is how to write that condition in regex? P.S. I want to do this in ONE regex. .
I can't suss a pattern for it, however your pattern is wrong ... <? function outbound_links( $from ) { if( ( $data = @file_get_contents( $from ) ) and preg_match_all( '#href=["|\'](.*?)["|\']#is', $data, $links ) > 0 ) { foreach( $links[1] as $link ) { if( ( substr( strtolower( $link ), 0, strlen( $from ) ) == strtolower( $from ) ) or !ereg( 'http://', $link ) ) $return[] = trim( $link ); } } return $return ; } $data = outbound_links( 'http://www.digitalpoint.com' ); if( is_array( $data ) ) { foreach( $data as $count => $link ) { printf("Link #%d : %s<br />\n", $count, $link ); } } ?> PHP: that works and won't be much heaver on resources than a massively complicated pattern ( assuming the pattern is possible, I did try )