On-domain link matching regex

Discussion in 'PHP' started by AHA7, Jun 12, 2007.

  1. #1
    Hello,

    How can I write a regular expression to match all on-domain links on a page but not off-domain links.

    Here's the regex that matches any link in a <a> tag (the link matches the parenthesized set):

    $link_regex = '#<a\b[^>]*\bhref=["]([^"]+)["][^>]*>#is';

    But if I know the domain name (say it's http://www.example.com) and I want to match all relative links on the page and the absolute ones that are on-domain only (Some pages have abs. links while others have rel. links).

    How can I write the following in regex:
    If the link starts with http://www.example.com OR if it does not start with http:// then match it?

    Now if the link is on-domain and absolute, then the first part of the condition would be true, the second false, true OR false = true => a match.
    If the link is on-domain relative, then the first part of the condition would be false, the second true, false OR true = true => a match.
    If the link if off-domain, then the first part of the condition would be false, the second false (it has to start with http:// since it's off-domain), false OR false = false => no match.

    The problem is how to write that condition in regex?

    P.S. I want to do this in ONE regex.


    .
     
    AHA7, Jun 12, 2007 IP
  2. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #2
    I can't suss a pattern for it, however your pattern is wrong ...

    
    <?
    function outbound_links( $from )
    {
    	if( ( $data = @file_get_contents( $from ) ) and preg_match_all( '#href=["|\'](.*?)["|\']#is', $data, $links ) > 0 )
    	{
    		foreach( $links[1] as $link )
    		{
    			if( ( substr( strtolower( $link ), 0, strlen( $from ) ) == strtolower( $from ) ) 
    				or !ereg( 'http://', $link ) )
    				$return[] = trim( $link );	
    		}
    	}
    	return $return ;
    }
    $data = outbound_links( 'http://www.digitalpoint.com' );
    if( is_array( $data ) )
    {
    	foreach( $data  as $count => $link )
    	{
    		printf("Link #%d : %s<br />\n", $count, $link );
    	}	
    }
    ?>
    
    PHP:
    that works and won't be much heaver on resources than a massively complicated pattern ( assuming the pattern is possible, I did try )
     
    krakjoe, Jun 12, 2007 IP