Any idea why this code that I found on one of nico's posts fail to extract all url's on a given string? function get_all_links ( $string ) { if ( preg_match_all( '/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/i', $string, $links ) ) { return array_unique( $links[1] ); } else { return false; } } PHP: I have also attached a file to run a test. It prints 30 links instead of 80 or so as it should.
Try adding the s modifier to the pattern. '/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si' PHP: And if that's my code, someone modified it. I'm not trying to use that as excuse, lol. But it's not my writing style.
Hehe, okay okay, I was just wondering. Btw, if you want to exclude anchors and javascripts, you can use this pattern: '/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>.*?<\/a>/si' PHP:
10q nico. I know I stressed all this forum with my regexes but how would you code one if you would have to extract all the anchor texts from such a link. For example let's take this string. <a href="something.com" title="something"> something else <em>here's my problem</em> <span>here's another problem</span> </a> HTML: The rsult should be : "something else here's my problem here's another problem" I don't know how to ignore the html tags and let the regex extract only what's text. Thank you.
Give this a try: function get_all_links($string) { if (preg_match_all('/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>(.*?)<\/a>/si', $string, $links)) { // Get rid of the HTML tags $links[2] = array_map('strip_tags', $links[2]); // Get rid of full pattern matches unset($links[0]); return $links; } return false; } PHP:
IT seems that you're hard to chessmate - not that I want it. Is there a way to go further and extract the same text between <a> </a> tags but only for outgoing links, or internal ?
Okay, it's getting a little more complex. function get_all_links($string, $domain = 'roscripts.com') { if (preg_match_all('/<a[^>]+href\s*=\s*["\'](?!(?:#|javascript\s*:))([^"\']+)[^>]*>(.*?)<\/a>/si', $string, $links)) { $domain = preg_quote($domain, '/'); foreach (array_keys($links[1]) AS $key) { if (preg_match("/^(ht|f)tps?:\/\/(?!({$domain}|(\w+\.)?{$domain}))/i", $links[1][$key])) { $type = 'external'; } else { $type = 'internal'; } if (!$text = preg_replace('/\s{2,}/', ' ', strip_tags(trim($links[2][$key]), '<img>'))) { $text = 'Undefined link text'; } $links[$type]['url'][] = $links[1][$key]; $links[$type]['text'][] = $text; } // Clean array unset($links[0], $links[1], $links[2]); return $links; } return false; } PHP: Usage example: echo '<pre>'; $links = get_all_links($string_toParse, 'roscripts.com'); foreach (array_keys($links) AS $type) { echo "<p><strong>{$type}</strong></p>\n"; foreach (array_keys($links[$type]['url']) AS $key) { echo '<a href="'. $links[$type]['url'][$key] .'">'. $links[$type]['text'][$key] .'</a>' . "\n"; } } echo '</pre>'; PHP: This gets pretty much everything. But it's quite easy to get only the text of external or internal links, based on the example.
I'm pushing my luck. It works so great except for one tiny thing. A subdomain is considered external and vice versa. (domain.com is external confrunted with subdomain.domain.com and the same is with subdomain.domain.com confrunted with domain.com). Any work arrounds?
Try replacing this: "/^(ht|f)tps?:\/\/(?!({$domain}|(\w+\.)?{$domain}))/i" PHP: With: "/^(ht|f)tps?:\/\/(?!((www\.)?{$domain}))/i" PHP: That would still make the www. optional. So domain.com and www.domain.com would be considered the same. And I'm glad it works.
suppose i want to extract all the links in a html based on its text how can i do it..?? <a href"http://www.google.com">google</a> i want to search for all links having the text google and extract them...