Regex to extract all links form webpage

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

Any idea why this code that I found on one of nico's posts fail to extract all url's on a given string?
function get_all_links ( $string )
	{
		if ( preg_match_all( '/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/i', $string, $links ) )
		{		
			return array_unique( $links[1] );
		}
		else {
			return false;
		}
	}
PHP:
I have also attached a file to run a test. It prints 30 links instead of 80 or so as it should.

Attached Files:

amorph, Jun 30, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#2

Try adding the s modifier to the pattern.
'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'
PHP:
And if that's my code, someone modified it. I'm not trying to use that as excuse, lol. But it's not my writing style.

nico_swd, Jun 30, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#3

No no no...that's just your regex. Anything else is my writing style Thank you. that worked.

amorph, Jun 30, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#4

Hehe, okay okay, I was just wondering.

Btw, if you want to exclude anchors and javascripts, you can use this pattern:
'/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>.*?<\/a>/si'
PHP:

nico_swd, Jun 30, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

10q nico. I know I stressed all this forum with my regexes but how would you code one if you would have to extract all the anchor texts from such a link. For example let's take this string.
<a href="something.com" title="something">

something else
<em>here's my problem</em>
<span>here's another problem</span>

</a>
HTML:
The rsult should be :
"something else here's my problem here's another problem"

I don't know how to ignore the html tags and let the regex extract only what's text.

Thank you.

amorph, Jun 30, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#6

Give this a try:


function get_all_links($string)
{
	if (preg_match_all('/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>(.*?)<\/a>/si', $string, $links))
	{
		// Get rid of the HTML tags
		$links[2] = array_map('strip_tags', $links[2]);
		// Get rid of full pattern matches
		unset($links[0]);
		
		return $links;
	}
	
	return false;
}

PHP:

nico_swd, Jun 30, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#7

IT seems that you're hard to chessmate - not that I want it.

Is there a way to go further and extract the same text between <a> </a> tags but only for outgoing links, or internal ?

amorph, Jun 30, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#8

Okay, it's getting a little more complex.


function get_all_links($string, $domain = 'roscripts.com')
{
	if (preg_match_all('/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>(.*?)<\/a>/si', $string, $links))
	{
		$domain = preg_quote($domain, '/');

		foreach (array_keys($links[1]) AS $key)
		{
			if (preg_match("/^(ht|f)tps?:\/\/(?!({$domain}|(\w+\.)?{$domain}))/i", $links[1][$key]))
			{
				$type = 'external';
			}
			else
			{
				$type = 'internal';
			}
			
			if (!$text = preg_replace('/\s{2,}/', ' ', strip_tags(trim($links[2][$key]), '<img>')))
			{
				$text = 'Undefined link text';
			}
			
			$links[$type]['url'][]  = $links[1][$key];
			$links[$type]['text'][] = $text;
		}
		// Clean array
		unset($links[0], $links[1], $links[2]);

		return $links;
	}
	
	return false;
}

PHP:

Usage example:


echo '<pre>';

$links = get_all_links($string_toParse, 'roscripts.com');

foreach (array_keys($links) AS $type)
{
	echo "<p><strong>{$type}</strong></p>\n";
	
	foreach (array_keys($links[$type]['url']) AS $key)
	{
		echo '<a href="'. $links[$type]['url'][$key] .'">'. $links[$type]['text'][$key] .'</a>' . "\n";
	}
}

echo '</pre>';

PHP:

This gets pretty much everything. But it's quite easy to get only the text of external or internal links, based on the example.

nico_swd, Jul 1, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#9

oh man...you're a gold mine. Don't you leave this forum )

amorph, Jul 1, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#10

I'm pushing my luck. It works so great except for one tiny thing. A subdomain is considered external and vice versa. (domain.com is external confrunted with subdomain.domain.com and the same is with subdomain.domain.com confrunted with domain.com). Any work arrounds?

amorph, Jul 1, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#11

Try replacing this:
"/^(ht|f)tps?:\/\/(?!({$domain}|(\w+\.)?{$domain}))/i"
PHP:
With:
"/^(ht|f)tps?:\/\/(?!((www\.)?{$domain}))/i"
PHP:
That would still make the www. optional. So domain.com and www.domain.com would be considered the same.

And I'm glad it works.

nico_swd, Jul 1, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#12

Yes. It captured only external links now. Super!

amorph, Jul 1, 2007 IP

b47chguru Member

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 31

#13

suppose i want to extract all the links in a html based on its text how can i do it..??
<a href"http://www.google.com">google</a>

i want to search for all links having the text google and extract them...

b47chguru, Apr 17, 2012 IP

Log in or Sign up

Regex to extract all links form webpage

amorph Peon

Attached Files:

index.php

nico_swd Prominent Member

amorph Peon

nico_swd Prominent Member

amorph Peon

nico_swd Prominent Member

amorph Peon

nico_swd Prominent Member

amorph Peon

amorph Peon

nico_swd Prominent Member

amorph Peon

b47chguru Member

Log in or Sign up

Regex to extract all links form webpage

amorph Peon

Attached Files:

index.php

nico_swd Prominent Member

amorph Peon

nico_swd Prominent Member

amorph Peon

nico_swd Prominent Member

amorph Peon

nico_swd Prominent Member

amorph Peon

amorph Peon

nico_swd Prominent Member

amorph Peon

b47chguru Member

Useful Searches