Regex to extract all links form webpage

Discussion in 'PHP' started by amorph, Jun 30, 2007.

  1. #1
    Any idea why this code that I found on one of nico's posts fail to extract all url's on a given string?
    
    function get_all_links ( $string )
    	{
    		if ( preg_match_all( '/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/i', $string, $links ) )
    		{		
    			return array_unique( $links[1] );
    		}
    		else {
    			return false;
    		}
    	}
    PHP:
    I have also attached a file to run a test. It prints 30 links instead of 80 or so as it should.
     

    Attached Files:

    amorph, Jun 30, 2007 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    Try adding the s modifier to the pattern.

    
    '/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'
    
    PHP:

    And if that's my code, someone modified it. I'm not trying to use that as excuse, lol. But it's not my writing style.
     
    nico_swd, Jun 30, 2007 IP
  3. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    No no no...that's just your regex. Anything else is my writing style :) Thank you. that worked.
     
    amorph, Jun 30, 2007 IP
  4. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #4
    Hehe, okay okay, I was just wondering. :)

    Btw, if you want to exclude anchors and javascripts, you can use this pattern:
    
    '/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>.*?<\/a>/si'
    
    PHP:
     
    nico_swd, Jun 30, 2007 IP
  5. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    10q nico. I know I stressed all this forum with my regexes but how would you code one if you would have to extract all the anchor texts from such a link. For example let's take this string.

    <a href="something.com" title="something">
    
    something else
    <em>here's my problem</em>
    <span>here's another problem</span>
    
    </a>
    HTML:
    The rsult should be :
    "something else here's my problem here's another problem"

    I don't know how to ignore the html tags and let the regex extract only what's text.

    Thank you.
     
    amorph, Jun 30, 2007 IP
  6. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #6
    Give this a try:
    
    function get_all_links($string)
    {
    	if (preg_match_all('/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>(.*?)<\/a>/si', $string, $links))
    	{
    		// Get rid of the HTML tags
    		$links[2] = array_map('strip_tags', $links[2]);
    		// Get rid of full pattern matches
    		unset($links[0]);
    		
    		return $links;
    	}
    	
    	return false;
    }	
    
    
    PHP:
     
    nico_swd, Jun 30, 2007 IP
  7. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #7
    IT seems that you're hard to chessmate :) - not that I want it.

    Is there a way to go further and extract the same text between <a> </a> tags but only for outgoing links, or internal ?
     
    amorph, Jun 30, 2007 IP
  8. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #8
    Okay, it's getting a little more complex.
    
    function get_all_links($string, $domain = 'roscripts.com')
    {
    	if (preg_match_all('/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>(.*?)<\/a>/si', $string, $links))
    	{
    		$domain = preg_quote($domain, '/');
    
    		foreach (array_keys($links[1]) AS $key)
    		{
    			if (preg_match("/^(ht|f)tps?:\/\/(?!({$domain}|(\w+\.)?{$domain}))/i", $links[1][$key]))
    			{
    				$type = 'external';
    			}
    			else
    			{
    				$type = 'internal';
    			}
    			
    			if (!$text = preg_replace('/\s{2,}/', ' ', strip_tags(trim($links[2][$key]), '<img>')))
    			{
    				$text = 'Undefined link text';
    			}
    			
    			$links[$type]['url'][]  = $links[1][$key];
    			$links[$type]['text'][] = $text;
    		}
    		// Clean array
    		unset($links[0], $links[1], $links[2]);
    
    		return $links;
    	}
    	
    	return false;
    }
    
    
    PHP:
    Usage example:
    
    echo '<pre>';
    
    $links = get_all_links($string_toParse, 'roscripts.com');
    
    foreach (array_keys($links) AS $type)
    {
    	echo "<p><strong>{$type}</strong></p>\n";
    	
    	foreach (array_keys($links[$type]['url']) AS $key)
    	{
    		echo '<a href="'. $links[$type]['url'][$key] .'">'. $links[$type]['text'][$key] .'</a>' . "\n";
    	}
    }
    
    echo '</pre>';
    
    PHP:
    This gets pretty much everything. But it's quite easy to get only the text of external or internal links, based on the example.
     
    nico_swd, Jul 1, 2007 IP
  9. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    oh man...you're a gold mine. Don't you leave this forum :))
     
    amorph, Jul 1, 2007 IP
  10. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I'm pushing my luck. It works so great except for one tiny thing. A subdomain is considered external and vice versa. (domain.com is external confrunted with subdomain.domain.com and the same is with subdomain.domain.com confrunted with domain.com). Any work arrounds? :)
     
    amorph, Jul 1, 2007 IP
  11. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #11
    Try replacing this:
    
    "/^(ht|f)tps?:\/\/(?!({$domain}|(\w+\.)?{$domain}))/i"
    
    PHP:
    With:
    
    "/^(ht|f)tps?:\/\/(?!((www\.)?{$domain}))/i"
    
    PHP:
    That would still make the www. optional. So domain.com and www.domain.com would be considered the same.

    And I'm glad it works. :)
     
    nico_swd, Jul 1, 2007 IP
  12. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Yes. It captured only external links now. Super!
     
    amorph, Jul 1, 2007 IP
  13. b47chguru

    b47chguru Member

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    31
    #13
    suppose i want to extract all the links in a html based on its text how can i do it..??
    <a href"http://www.google.com">google</a>

    i want to search for all links having the text google and extract them...
     
    b47chguru, Apr 17, 2012 IP